Hi, I'm not sure if this has been brought up before: I found a bug in StandardTokenizer which misinterprets non-ASCII characters as word boundaries. This happens only with words that contain non- alphanumeric characters.
Consider this example: The text 'Gerd Schröder Straße' is properly tokenized to: ["Gerd", "Schröder", "Straße"] as well as 'Gerd-Schroeder-Strasse': ["Gerd-Schroeder-Strasse"] but 'Gerd-Schröder-Straße' yields: ["Gerd-Schr", "öder-Stra", "ße"] So apparently, multibyte and non-word characters don't mix... Cheers, Andy _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

