Hi Darren,

The question was, how given a string "aboutus" in a document, you can return
that document as a result to the query "about us" (note the space). So we're
mostly discussing how to detect and then break the word "aboutus" to two
words.

What you wrote though seems interesting as well, only I think not related to
Harig's original question. Maybe he'll be interested in that too though.

Shai

On Tue, Aug 4, 2009 at 6:27 PM, <dar...@ontrenet.com> wrote:

> Just catching this thread, but if I understand what is being asked I can
> share how I do multi-word phrase matching. If that's not what's wanted,
> pardons!
>
> Ok, I load an entire dictionary into a lucene index, phrases and all.
>
> When I'm scanning some text, I do lookups in this dictionary index using
> one word at a time with the word _at the beginning_ of the indexed field
> only. This returns all words/phrases beginning with the word I searched
> for.
>
> I then scan the rest of the input text and compare it to the longest
> matching phrase in my lucene results. That then becomes a meaningful
> token.
>
> Input text:
> "The President of the United States lives in the White House"
>
> Tokens:
> "The"
> "President of the United States"
> "lives"
> "in"
> "the"
> "White House"
>
> Term: "President"
> Result:
> "President of a Company"
> "President"
> "President of the United States"
>
> Take the longest match.
>
> HTH,
> Darren
>
>
>
> > On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera<ser...@gmail.com> wrote:
> >> 2) Use a dictionary (real dictionary), and search it for every
> >> substring,
> >> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it
> >> there.
> >> This needs some fine tuning, like checking if the rest is also a word
> >> and if
> >> the full string is also a word, so that you don't break up meaningful
> >> words.
> >> You'll need to get a dictionary for that.
> >
> > I do not have a solution to this, but it strikes me as very similar to
> > they way you traverse Japanese to break words, since that has no
> > spaces. Is there a Japanese tokenizer and, if so, does it handle this?
> > If so, you could replace the Japanese dictionary with an English
> > dictionary. Just a random thought had that might / might not help.
> >
> > Phil
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to