On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera<ser...@gmail.com> wrote:
> 2) Use a dictionary (real dictionary), and search it for every substring,
> e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there.
> This needs some fine tuning, like checking if the rest is also a word and if
> the full string is also a word, so that you don't break up meaningful words.
> You'll need to get a dictionary for that.

I do not have a solution to this, but it strikes me as very similar to
they way you traverse Japanese to break words, since that has no
spaces. Is there a Japanese tokenizer and, if so, does it handle this?
If so, you could replace the Japanese dictionary with an English
dictionary. Just a random thought had that might / might not help.

Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to