>> Just to give another example: Uzbek in Latin script uses "o'" and "g'"
>> as opposed to "o" and "g", such as in the language designation
>> "O'zbek" where "o'" stands for the sound designated in Cyrillic script
>> by U+040E and "g'" is equivalent to U+0493.

MC> "O'zbek" would not split, because the apostrophe is not followed by "a",
MC> "e", "i", "o", "u" or "y".

"G'iyosaddin" would (sorry for the silly word, it's the middle name of
a medieval poet, but it's the first thing that came into my mind, and
"g'" is not such a rare combination in Uzbek that this is the only
case). You can't sensibly base a general-purpose word splitting
algorithm on the French and Italian definition of "vowel".

It is probably impossible to do that without looking at the language
of your encoded string.

  Philipp                            mailto:[EMAIL PROTECTED]
___________________
With searching comes loss / and the presence of absence / The data, not found


Reply via email to