Ahhhh, ok. Interesting problem there as well.

I'll think on that one some too!

cheers.

> Hi Darren,
>
> The question was, how given a string "aboutus" in a document, you can
> return
> that document as a result to the query "about us" (note the space). So
> we're
> mostly discussing how to detect and then break the word "aboutus" to two
> words.
>
> What you wrote though seems interesting as well, only I think not related
> to
> Harig's original question. Maybe he'll be interested in that too though.
>
> Shai
>
> On Tue, Aug 4, 2009 at 6:27 PM, <dar...@ontrenet.com> wrote:
>
>> Just catching this thread, but if I understand what is being asked I can
>> share how I do multi-word phrase matching. If that's not what's wanted,
>> pardons!
>>
>> Ok, I load an entire dictionary into a lucene index, phrases and all.
>>
>> When I'm scanning some text, I do lookups in this dictionary index using
>> one word at a time with the word _at the beginning_ of the indexed field
>> only. This returns all words/phrases beginning with the word I searched
>> for.
>>
>> I then scan the rest of the input text and compare it to the longest
>> matching phrase in my lucene results. That then becomes a meaningful
>> token.
>>
>> Input text:
>> "The President of the United States lives in the White House"
>>
>> Tokens:
>> "The"
>> "President of the United States"
>> "lives"
>> "in"
>> "the"
>> "White House"
>>
>> Term: "President"
>> Result:
>> "President of a Company"
>> "President"
>> "President of the United States"
>>
>> Take the longest match.
>>
>> HTH,
>> Darren
>>
>>
>>
>> > On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera<ser...@gmail.com> wrote:
>> >> 2) Use a dictionary (real dictionary), and search it for every
>> >> substring,
>> >> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it
>> >> there.
>> >> This needs some fine tuning, like checking if the rest is also a word
>> >> and if
>> >> the full string is also a word, so that you don't break up meaningful
>> >> words.
>> >> You'll need to get a dictionary for that.
>> >
>> > I do not have a solution to this, but it strikes me as very similar to
>> > they way you traverse Japanese to break words, since that has no
>> > spaces. Is there a Japanese tokenizer and, if so, does it handle this?
>> > If so, you could replace the Japanese dictionary with an English
>> > dictionary. Just a random thought had that might / might not help.
>> >
>> > Phil
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> > For additional commands, e-mail: java-user-h...@lucene.apache.org
>> >
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to