Re: sentence detector newline behavior

2014-01-26 Thread Miller, Timothy

On 01/26/2014 09:59 AM, Jörn Kottmann wrote:
>
> The evaluation should ignore white spaces. I committed now my fix, it 
> would be nice if you can
> test it.
>
> There might be still something wrong. In my test data I replaced all 
> question marks with white spaces, and the result
> is slightly worse than with the original data.
>
> Jörn
Yes, this fixes the whitespace sentence issue but the evaluation issue
remains. I believe the problem is in SentenceSampleStream, where in the
following block the whitespace trim happens before the  character is
replaced with the \n character. So test sentences that ended with 
will be one character longer than they should be.

>   sentence = sentence.trim();
>   sentence = replaceNewLineEscapeTags(sentence);
>   sentencesString.append(sentence);
>   int end = sentencesString.length();
>   sentenceSpans.add(new Span(begin, end));
>   sentencesString.append(' ');



Re: sentence detector newline behavior

2014-01-26 Thread Jörn Kottmann

On 01/25/2014 10:03 PM, Miller, Timothy wrote:

On 01/25/2014 12:24 PM, Jörn Kottmann wrote:

The code which computes the spans tries to remove white space from it.
Removing the white space from a whitespace only sentence is causing
the exception your are seeing. Which response would you expect from
the sentence detector? Should a white space only sentence be returned?

I would say no.


In case a sentence is terminated by a new line. Should the new line
char be included in the sentence span or not?

I would also say no.


I made a quick patch for this issue -- now it runs but scores really
poorly compared to my model file (30 vs 75 or so). I suspect something
is wrong with the evaluation, the spans being slightly off somehow.


The evaluation should ignore white spaces. I committed now my fix, it 
would be nice if you can

test it.

There might be still something wrong. In my test data I replaced all 
question marks with white spaces, and the result

is slightly worse than with the original data.

Jörn