I haven't read the paper, but I bet you a nickel that the unsupervised
thing works can consume an initial set.

On Fri, Jan 15, 2010 at 10:26 PM, Ted Dunning <[email protected]> wrote:
> One of their points was that the set of abbreviations is productive and
> using a static set is sub-optimal.
>
> Whether we need those last few percent is a matter of discussion, I would
> say.
>
> On Fri, Jan 15, 2010 at 6:41 PM, Benson Margulies 
> <[email protected]>wrote:
>
>> It's not very hard to collect the abbreviations. It may be less work
>> than coding what's in the paper.
>>
>> On Fri, Jan 15, 2010 at 9:37 PM, Ted Dunning <[email protected]>
>> wrote:
>> > The exceptions are a list of abbreviations that have a terminal full
>> stop,
>> > but which are customarily terminated by a capitalized word which is not
>> the
>> > start of a new sentence.
>> >
>> > It looks to me like machine learning has come a long way in this regard.
>> > This is the best paper on the subject that I have seen in a quick search.
>> >
>> > Unsupervised Multilingual Sentence Boundary Detection
>> > <http://www.linguistics.ruhr-uni-bochum.de/%7Estrunk/ks2005FINAL.pdf> by
>> > Kiss and Strunk.
>> >
>> > It doesn't require any lexical resources and can improve performance on
>> the
>> > fly by adapting to the language that it is working against.
>> >
>> > The fundamental insight is that abbreviations are situations where a stem
>> is
>> > very commonly followed by a full stop and a sentence start marker is
>> > something that is very commonly preceded by a full stop or other sentence
>> > marker.  From these and a few other intuitions, they build a system that
>> is
>> > pretty darned accurate.  One major component of its accuracy is due to
>> the
>> > ability to adapt on the fly to the corpus in use.
>> >
>> > A deficiency in our use case would be the requirement for training text,
>> but
>> > that could be solved with a few moderate sized resources that are the
>> result
>> > of training on reference texts for different languages.
>> >
>> > On Fri, Jan 15, 2010 at 2:49 PM, Drew Farris <[email protected]>
>> wrote:
>> >
>> >> I've found abbrevs, various identifiers etc are sort of a typical case
>> >> where these things fall flat. I'll see how it performs viz writing
>> >> something from scratch and see what I can come up with.
>> >>
>> >> > Right, although just slightly ironic that we are using a rule-based
>> >> system for a machine learning project.
>> >>
>> >> Heh, indeed, but it seems entirely appropriate in this case. Of
>> >> course, now I need to go read about statistical approaches to sentence
>> >> boundary detection.
>> >>
>> >
>> >
>> >
>> > --
>> > Ted Dunning, CTO
>> > DeepDyve
>> >
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Reply via email to