Interesting ... I don't have access to a Japanese dictionary, so I just
extract bi-grams. But I guess that in this case, if one can access an
English dictionary (are you aware of an "open-source" one, or free one
BTW?), one can use the method you mention.

But still, doing this for every Token you meet is extremely expensive (for
Japanese is all you can do, but this case is rather special), so I'd first
make sure I can pinpoint the very small number of possible tokens I should
process like that.

Shai

On Tue, Aug 4, 2009 at 6:37 PM, Phil Whelan <phil...@gmail.com> wrote:

> On Tue, Aug 4, 2009 at 8:31 AM, Shai Erera<ser...@gmail.com> wrote:
> > Hi Darren,
> >
> > The question was, how given a string "aboutus" in a document, you can
> return
> > that document as a result to the query "about us" (note the space). So
> we're
> > mostly discussing how to detect and then break the word "aboutus" to two
> > words.
>
> When traversing Japanese text you have a use a similar algorithm to
> searching a maze (keep left and retrace your steps). It's possible to
> go a long way along sentence before you find the tokens you've already
> picked out are invalid. Rough example...
>
> thereallibrary
> there allibrary
> there all i brary (fail)
> the reallibrary
> the real library
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

Reply via email to