Re: sentence detector newline behavior
On 01/26/2014 09:59 AM, Jörn Kottmann wrote: > > The evaluation should ignore white spaces. I committed now my fix, it > would be nice if you can > test it. > > There might be still something wrong. In my test data I replaced all > question marks with white spaces, and the result > is slightly worse than with the original data. > > Jörn Yes, this fixes the whitespace sentence issue but the evaluation issue remains. I believe the problem is in SentenceSampleStream, where in the following block the whitespace trim happens before the character is replaced with the \n character. So test sentences that ended with will be one character longer than they should be. > sentence = sentence.trim(); > sentence = replaceNewLineEscapeTags(sentence); > sentencesString.append(sentence); > int end = sentencesString.length(); > sentenceSpans.add(new Span(begin, end)); > sentencesString.append(' ');
Re: sentence detector newline behavior
On 01/25/2014 10:03 PM, Miller, Timothy wrote: On 01/25/2014 12:24 PM, Jörn Kottmann wrote: The code which computes the spans tries to remove white space from it. Removing the white space from a whitespace only sentence is causing the exception your are seeing. Which response would you expect from the sentence detector? Should a white space only sentence be returned? I would say no. In case a sentence is terminated by a new line. Should the new line char be included in the sentence span or not? I would also say no. I made a quick patch for this issue -- now it runs but scores really poorly compared to my model file (30 vs 75 or so). I suspect something is wrong with the evaluation, the spans being slightly off somehow. The evaluation should ignore white spaces. I committed now my fix, it would be nice if you can test it. There might be still something wrong. In my test data I replaced all question marks with white spaces, and the result is slightly worse than with the original data. Jörn