Just catching this thread, but if I understand what is being asked I can
share how I do multi-word phrase matching. If that's not what's wanted,
pardons!

Ok, I load an entire dictionary into a lucene index, phrases and all.

When I'm scanning some text, I do lookups in this dictionary index using
one word at a time with the word _at the beginning_ of the indexed field
only. This returns all words/phrases beginning with the word I searched
for.

I then scan the rest of the input text and compare it to the longest
matching phrase in my lucene results. That then becomes a meaningful
token.

Input text:
"The President of the United States lives in the White House"

Tokens:
"The"
"President of the United States"
"lives"
"in"
"the"
"White House"

Term: "President"
Result:
"President of a Company"
"President"
"President of the United States"

Take the longest match.

HTH,
Darren



> On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera<ser...@gmail.com> wrote:
>> 2) Use a dictionary (real dictionary), and search it for every
>> substring,
>> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it
>> there.
>> This needs some fine tuning, like checking if the rest is also a word
>> and if
>> the full string is also a word, so that you don't break up meaningful
>> words.
>> You'll need to get a dictionary for that.
>
> I do not have a solution to this, but it strikes me as very similar to
> they way you traverse Japanese to break words, since that has no
> spaces. Is there a Japanese tokenizer and, if so, does it handle this?
> If so, you could replace the Japanese dictionary with an English
> dictionary. Just a random thought had that might / might not help.
>
> Phil
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to