Kevin Atkinson wrote: > Samphan Raruenrom wrote: > > In Thai, we don't put spaces between words at all so > > the same situation happends naturally. > > Typical Thai word-segmentation algorithm (which usually > > do spelling check also) use maximal-match backtracking > > algorithm with trie word list(s). > > My implementation is at http://www.thai.net/libinthai/ > > IBM Classes for Unicode implementation is at > > http://www.ibm.com/java/education/boundaries/boundaries.html > Ok so how do you detect bonduries of unknown or misspelled words. IBM ICU's algorithm describe in the above URL is :- : If we exhausted our possibilities without finding : a valid sequence of words, it either means there's : an error in the text, or the text includes a word : that isn't in the dictionary. In either case, we restore : the set of break positions that matched the most : characters, advance one character past where the : mismatch occurred in that sequence, and start over : from there. This works pretty well: usually only : one or two boundary positions around the error : are in the wrong place. --- Note: This message was origanlly posted to [EMAIL PROTECTED], not [EMAIL PROTECTED] _______________________________________________ aspell-user mailing list [EMAIL PROTECTED] http://lists.sourceforge.net/mailman/listinfo/aspell-user
