The exceptions are a list of abbreviations that have a terminal full stop, but which are customarily terminated by a capitalized word which is not the start of a new sentence.
It looks to me like machine learning has come a long way in this regard. This is the best paper on the subject that I have seen in a quick search. Unsupervised Multilingual Sentence Boundary Detection <http://www.linguistics.ruhr-uni-bochum.de/%7Estrunk/ks2005FINAL.pdf> by Kiss and Strunk. It doesn't require any lexical resources and can improve performance on the fly by adapting to the language that it is working against. The fundamental insight is that abbreviations are situations where a stem is very commonly followed by a full stop and a sentence start marker is something that is very commonly preceded by a full stop or other sentence marker. From these and a few other intuitions, they build a system that is pretty darned accurate. One major component of its accuracy is due to the ability to adapt on the fly to the corpus in use. A deficiency in our use case would be the requirement for training text, but that could be solved with a few moderate sized resources that are the result of training on reference texts for different languages. On Fri, Jan 15, 2010 at 2:49 PM, Drew Farris <[email protected]> wrote: > I've found abbrevs, various identifiers etc are sort of a typical case > where these things fall flat. I'll see how it performs viz writing > something from scratch and see what I can come up with. > > > Right, although just slightly ironic that we are using a rule-based > system for a machine learning project. > > Heh, indeed, but it seems entirely appropriate in this case. Of > course, now I need to go read about statistical approaches to sentence > boundary detection. > -- Ted Dunning, CTO DeepDyve
