Just catching this thread, but if I understand what is being asked I can share how I do multi-word phrase matching. If that's not what's wanted, pardons!
Ok, I load an entire dictionary into a lucene index, phrases and all. When I'm scanning some text, I do lookups in this dictionary index using one word at a time with the word _at the beginning_ of the indexed field only. This returns all words/phrases beginning with the word I searched for. I then scan the rest of the input text and compare it to the longest matching phrase in my lucene results. That then becomes a meaningful token. Input text: "The President of the United States lives in the White House" Tokens: "The" "President of the United States" "lives" "in" "the" "White House" Term: "President" Result: "President of a Company" "President" "President of the United States" Take the longest match. HTH, Darren > On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera<ser...@gmail.com> wrote: >> 2) Use a dictionary (real dictionary), and search it for every >> substring, >> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it >> there. >> This needs some fine tuning, like checking if the rest is also a word >> and if >> the full string is also a word, so that you don't break up meaningful >> words. >> You'll need to get a dictionary for that. > > I do not have a solution to this, but it strikes me as very similar to > they way you traverse Japanese to break words, since that has no > spaces. Is there a Japanese tokenizer and, if so, does it handle this? > If so, you could replace the Japanese dictionary with an English > dictionary. Just a random thought had that might / might not help. > > Phil > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org