>> Just to give another example: Uzbek in Latin script uses "o'" and "g'" >> as opposed to "o" and "g", such as in the language designation >> "O'zbek" where "o'" stands for the sound designated in Cyrillic script >> by U+040E and "g'" is equivalent to U+0493.
MC> "O'zbek" would not split, because the apostrophe is not followed by "a", MC> "e", "i", "o", "u" or "y". "G'iyosaddin" would (sorry for the silly word, it's the middle name of a medieval poet, but it's the first thing that came into my mind, and "g'" is not such a rare combination in Uzbek that this is the only case). You can't sensibly base a general-purpose word splitting algorithm on the French and Italian definition of "vowel". It is probably impossible to do that without looking at the language of your encoded string. Philipp mailto:[EMAIL PROTECTED] ___________________ With searching comes loss / and the presence of absence / The data, not found