One of their points was that the set of abbreviations is productive and using a static set is sub-optimal.
Whether we need those last few percent is a matter of discussion, I would say. On Fri, Jan 15, 2010 at 6:41 PM, Benson Margulies <[email protected]>wrote: > It's not very hard to collect the abbreviations. It may be less work > than coding what's in the paper. > > On Fri, Jan 15, 2010 at 9:37 PM, Ted Dunning <[email protected]> > wrote: > > The exceptions are a list of abbreviations that have a terminal full > stop, > > but which are customarily terminated by a capitalized word which is not > the > > start of a new sentence. > > > > It looks to me like machine learning has come a long way in this regard. > > This is the best paper on the subject that I have seen in a quick search. > > > > Unsupervised Multilingual Sentence Boundary Detection > > <http://www.linguistics.ruhr-uni-bochum.de/%7Estrunk/ks2005FINAL.pdf> by > > Kiss and Strunk. > > > > It doesn't require any lexical resources and can improve performance on > the > > fly by adapting to the language that it is working against. > > > > The fundamental insight is that abbreviations are situations where a stem > is > > very commonly followed by a full stop and a sentence start marker is > > something that is very commonly preceded by a full stop or other sentence > > marker. From these and a few other intuitions, they build a system that > is > > pretty darned accurate. One major component of its accuracy is due to > the > > ability to adapt on the fly to the corpus in use. > > > > A deficiency in our use case would be the requirement for training text, > but > > that could be solved with a few moderate sized resources that are the > result > > of training on reference texts for different languages. > > > > On Fri, Jan 15, 2010 at 2:49 PM, Drew Farris <[email protected]> > wrote: > > > >> I've found abbrevs, various identifiers etc are sort of a typical case > >> where these things fall flat. I'll see how it performs viz writing > >> something from scratch and see what I can come up with. > >> > >> > Right, although just slightly ironic that we are using a rule-based > >> system for a machine learning project. > >> > >> Heh, indeed, but it seems entirely appropriate in this case. Of > >> course, now I need to go read about statistical approaches to sentence > >> boundary detection. > >> > > > > > > > > -- > > Ted Dunning, CTO > > DeepDyve > > > -- Ted Dunning, CTO DeepDyve
