On Tue, Aug 4, 2009 at 3:56 AM, Shai Erera<ser...@gmail.com> wrote: > 2) Use a dictionary (real dictionary), and search it for every substring, > e.g. "a", "ab", "abo" ... "about" etc. If you find a match, split it there. > This needs some fine tuning, like checking if the rest is also a word and if > the full string is also a word, so that you don't break up meaningful words. > You'll need to get a dictionary for that.
I do not have a solution to this, but it strikes me as very similar to they way you traverse Japanese to break words, since that has no spaces. Is there a Japanese tokenizer and, if so, does it handle this? If so, you could replace the Japanese dictionary with an English dictionary. Just a random thought had that might / might not help. Phil --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org