I haven't read the paper, but I bet you a nickel that the unsupervised thing works can consume an initial set.
On Fri, Jan 15, 2010 at 10:26 PM, Ted Dunning <[email protected]> wrote: > One of their points was that the set of abbreviations is productive and > using a static set is sub-optimal. > > Whether we need those last few percent is a matter of discussion, I would > say. > > On Fri, Jan 15, 2010 at 6:41 PM, Benson Margulies > <[email protected]>wrote: > >> It's not very hard to collect the abbreviations. It may be less work >> than coding what's in the paper. >> >> On Fri, Jan 15, 2010 at 9:37 PM, Ted Dunning <[email protected]> >> wrote: >> > The exceptions are a list of abbreviations that have a terminal full >> stop, >> > but which are customarily terminated by a capitalized word which is not >> the >> > start of a new sentence. >> > >> > It looks to me like machine learning has come a long way in this regard. >> > This is the best paper on the subject that I have seen in a quick search. >> > >> > Unsupervised Multilingual Sentence Boundary Detection >> > <http://www.linguistics.ruhr-uni-bochum.de/%7Estrunk/ks2005FINAL.pdf> by >> > Kiss and Strunk. >> > >> > It doesn't require any lexical resources and can improve performance on >> the >> > fly by adapting to the language that it is working against. >> > >> > The fundamental insight is that abbreviations are situations where a stem >> is >> > very commonly followed by a full stop and a sentence start marker is >> > something that is very commonly preceded by a full stop or other sentence >> > marker. From these and a few other intuitions, they build a system that >> is >> > pretty darned accurate. One major component of its accuracy is due to >> the >> > ability to adapt on the fly to the corpus in use. >> > >> > A deficiency in our use case would be the requirement for training text, >> but >> > that could be solved with a few moderate sized resources that are the >> result >> > of training on reference texts for different languages. >> > >> > On Fri, Jan 15, 2010 at 2:49 PM, Drew Farris <[email protected]> >> wrote: >> > >> >> I've found abbrevs, various identifiers etc are sort of a typical case >> >> where these things fall flat. I'll see how it performs viz writing >> >> something from scratch and see what I can come up with. >> >> >> >> > Right, although just slightly ironic that we are using a rule-based >> >> system for a machine learning project. >> >> >> >> Heh, indeed, but it seems entirely appropriate in this case. Of >> >> course, now I need to go read about statistical approaches to sentence >> >> boundary detection. >> >> >> > >> > >> > >> > -- >> > Ted Dunning, CTO >> > DeepDyve >> > >> > > > > -- > Ted Dunning, CTO > DeepDyve >
