Interesting ... I don't have access to a Japanese dictionary, so I just extract bi-grams. But I guess that in this case, if one can access an English dictionary (are you aware of an "open-source" one, or free one BTW?), one can use the method you mention.
But still, doing this for every Token you meet is extremely expensive (for Japanese is all you can do, but this case is rather special), so I'd first make sure I can pinpoint the very small number of possible tokens I should process like that. Shai On Tue, Aug 4, 2009 at 6:37 PM, Phil Whelan <phil...@gmail.com> wrote: > On Tue, Aug 4, 2009 at 8:31 AM, Shai Erera<ser...@gmail.com> wrote: > > Hi Darren, > > > > The question was, how given a string "aboutus" in a document, you can > return > > that document as a result to the query "about us" (note the space). So > we're > > mostly discussing how to detect and then break the word "aboutus" to two > > words. > > When traversing Japanese text you have a use a similar algorithm to > searching a maze (keep left and retrace your steps). It's possible to > go a long way along sentence before you find the tokens you've already > picked out are invalid. Rough example... > > thereallibrary > there allibrary > there all i brary (fail) > the reallibrary > the real library > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >