Ahhhh, ok. Interesting problem there as well. I'll think on that one some too!
cheers. > Hi Darren, > > The question was, how given a string "aboutus" in a document, you can > return > that document as a result to the query "about us" (note the space). So > we're > mostly discussing how to detect and then break the word "aboutus" to two > words. > > What you wrote though seems interesting as well, only I think not related > to > Harig's original question. Maybe he'll be interested in that too though. > > Shai > > On Tue, Aug 4, 2009 at 6:27 PM, <dar...@ontrenet.com> wrote: > >> Just catching this thread, but if I understand what is being asked I can >> share how I do multi-word phrase matching. If that's not what's wanted, >> pardons! >> >> Ok, I load an entire dictionary into a lucene index, phrases and all. >> >> When I'm scanning some text, I do lookups in this dictionary index using >> one word at a time with the word _at the beginning_ of the indexed field >> only. This returns all words/phrases beginning with the word I searched >> for. >> >> I then scan the rest of the input text and compare it to the longest >> matching phrase in my lucene results. That then becomes a meaningful >> token. >> >> Input text: >> "The President of the United States lives in the White House" >> >> Tokens: >> "The" >> "President of the United States" >> "lives" >> "in" >> "the" >> "White House" >> >> Term: "President" >> Result: >> "President of a Company" >> "President" >> "President of the United States" >> >> Take the longest match. >> >> HTH, >> Darren >> >> >> >> > On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera<ser...@gmail.com> wrote: >> >> 2) Use a dictionary (real dictionary), and search it for every >> >> substring, >> >> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it >> >> there. >> >> This needs some fine tuning, like checking if the rest is also a word >> >> and if >> >> the full string is also a word, so that you don't break up meaningful >> >> words. >> >> You'll need to get a dictionary for that. >> > >> > I do not have a solution to this, but it strikes me as very similar to >> > they way you traverse Japanese to break words, since that has no >> > spaces. Is there a Japanese tokenizer and, if so, does it handle this? >> > If so, you could replace the Japanese dictionary with an English >> > dictionary. Just a random thought had that might / might not help. >> > >> > Phil >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> > For additional commands, e-mail: java-user-h...@lucene.apache.org >> > >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org