On 05/08/13 19:17, Christopher Kotfila wrote:
Just so I'm clear, true positives require correctly identifying the beginning and end of the sentence, and any model that misidentifies one sentence in fact misidentifies two because sentences are assumed to be appear in a serial fashion. The model won't have a chance to correctly identify the beginning of a new sentence until the end of the second sentence.
well, this is an implementation issue...as I said before these metrics are more general than say 'sentence-detection'. Therefore, when you say for instance precision of 75% it means that your model identified three quarters of all the individual sentences in your test-set, as individual sentences. In other words the predictions were 75% accurate. Your question is more on the practical side...well, if you implement a sentence-detection model using regular-expressions (and you certainly could) there is no reason to identify start & end...all you care is the split point and the start/end is assumed to be before/after that. Similarly, with a ML model your features might have nothing to do with the 'end-of-sentence' but rather you might put emphasis in 'start-of-sentence' where the features are usually richer.
If this is the case then sentences must be compared in an unordered way, The second sentence in my model cannot be compared to the second sentence in the ground truth because the first sentence of my model may have subsumed the second (or more) sentences. This is different (or is it?) than assuming there is a sentence token and trying to correctly classify the set of sentence tokens as they appear in the stream of tokens that make up the text.
again that depends on your implementation...
But just so i'm clear, the OpenNLP sentence detection module tests on exact matches (start and end of sentences) yes?
if you mean the maxent model, then no! It doesn't operate based on exact matches like say regex...the maxent sentence-detection module involves a pre-trained probabilistic classifier (aka maximum-entropy). It gives predictions by examining certain features in the text.
Jim
