Re: Collocation clarification

Ted Dunning Fri, 15 Jan 2010 19:56:05 -0800

The exceptions are a list of abbreviations that have a terminal full stop,
but which are customarily terminated by a capitalized word which is not the
start of a new sentence.

It looks to me like machine learning has come a long way in this regard.
This is the best paper on the subject that I have seen in a quick search.

Unsupervised Multilingual Sentence Boundary Detection
<http://www.linguistics.ruhr-uni-bochum.de/%7Estrunk/ks2005FINAL.pdf> by
Kiss and Strunk.

It doesn't require any lexical resources and can improve performance on the
fly by adapting to the language that it is working against.

The fundamental insight is that abbreviations are situations where a stem is
very commonly followed by a full stop and a sentence start marker is
something that is very commonly preceded by a full stop or other sentence
marker.  From these and a few other intuitions, they build a system that is
pretty darned accurate.  One major component of its accuracy is due to the
ability to adapt on the fly to the corpus in use.

A deficiency in our use case would be the requirement for training text, but
that could be solved with a few moderate sized resources that are the result
of training on reference texts for different languages.

On Fri, Jan 15, 2010 at 2:49 PM, Drew Farris <[email protected]> wrote:

> I've found abbrevs, various identifiers etc are sort of a typical case
> where these things fall flat. I'll see how it performs viz writing
> something from scratch and see what I can come up with.
>
> > Right, although just slightly ironic that we are using a rule-based
> system for a machine learning project.
>
> Heh, indeed, but it seems entirely appropriate in this case. Of
> course, now I need to go read about statistical approaches to sentence
> boundary detection.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: Collocation clarification

Reply via email to