Hi, it actually is quite nice and it can be used in production for such things as have been discussed lately in this group.
If you want to play it safe: The iterator breaks at dots after numbers (e.g. 15. March), the precision of the algorithm can be increased if you never break after a number. The implementation is fast. Regards, Karsten Mit freundlichen Grüßen aus Saarbrücken -- Dr.-Ing. Karsten Konrad Head of Artificial Intelligence Lab XtraMind Technologies GmbH Stuhlsatzenhausweg 3 D-66123 Saarbrücken Phone: +49 (681) 3025113 Fax: +49 (681) 3025109 [EMAIL PROTECTED] www.xtramind.com -----Ursprüngliche Nachricht----- Von: Philippe Laflamme [mailto:[EMAIL PROTECTED] Gesendet: Montag, 17. November 2003 15:39 An: Lucene Users List Betreff: RE: inter-term correlation [was Re: Vector Space Model in Lucene?] There is already an implementation in the Java API for sentence boundary detection. The BreakIterator in the java.text package has this to say about sentence splitting: "Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses." http://java.sun.com/j2se/1.4.1/docs/api/java/text/BreakIterator.html The whole i18n Java API is based on the ICU framework from IBM: http://oss.software.ibm.com/icu/index.html It supports many languages. I personally do not have any experience with the BreakIterator in Java. Has anyone used it in any production environment? I'd be very interested to learn more about it's efficiency. Regards, Phil > -----Original Message----- > From: Chong, Herb [mailto:[EMAIL PROTECTED] > Sent: November 17, 2003 08:53 > To: Lucene Users List > Subject: RE: inter-term correlation [was Re: Vector Space Model in > Lucene?] > > > i have a program written in Icon that does basic sentence splitting. > with about 5 heuristics and one small lookup table, i can get well > over 90% accuracy doing sentence boundary detection on email. for well > edited English text, like newswires, i can manage closer to 99%. this > is all that is needed for significantly improving a search engine's > performance when the query engine respects sentence boundaries. > incidentally, the GATE Information Extraction framework cites some > references that indicate that for named entity feature extraction, > their system can exceed the ability of trained humans to detect and > classify named entities if only one person does the detection. > collaborating humans are still better, but no-one has the time in > practical applications. > > you probably know, since you know about Markov chains, that within > sentence term correlation, and hence the language model, is different > than across sentences. linguists have known this for a very long time. > it isn't hard to put this capability into a search engine, but it > absolutely breaks down unless there is sentence boundary information > stored for use at query time. > > Herb.... > > -----Original Message----- > From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] > Sent: Friday, November 14, 2003 5:54 PM > To: Lucene Users List > Subject: Re: inter-term correlation [was Re: Vector Space Model in > Lucene?] > > > Well ... Sure, nothing can replace a human mind. But believe it or > not, there are studies which show that even human experts can > significantly differ in their opinions on what are key-phrases for a > given text. So, the results are never clear cut with humans either... > > So, in this sense a heuristic tool for sentence splitting and > key-phrase detection can go long ways. For example, the application I > mentioned, uses quite a few heuristic rules (+ Markov chains as a > heavier ammunition :-), and it comes up with the following phrases for > your email discussion (the text quoted below): > > (lang=EN): NLP, trainable rule-based tagging, natural language > processing, apache, NLP expert > > Now, this set of key-phrases does reflect the main noun-phrases in the > text... which means I have a practical and tangible benefit from NLP. > QED ;-) > > Best regards, > Andrzej > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]