2009/1/11 Gerard Meijssen <gerard.meijs...@gmail.com>: > How many characters are there according to your software in the word Mbɔ́tɛ > ? The correct answer is 5
Since I was working with the enwiki dump, I did not pay much attention to internationalisation issues. I arbitrarily defined a "word" as the Python regular expression: [\w\d]+ So, the answer to your question depends on how Python implements the \w word-matching regular expression atom: "When the LOCALE and UNICODE flags are not specified, matches any alphanumeric character and the underscore; this is equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus whatever characters are defined as alphanumeric for the current locale. If UNICODE is set, this will match the characters [0-9_] plus whatever is classified as alphanumeric in the Unicode character properties database. " Greg Hewgill http://hewgill.com _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l