On Fri, 29 May 2009 04:09:53 +0000, John Machin wrote: > John Machin <sjmachin <at> lexicon.net> writes: > >> Andrew Fong <FongAndrew <at> gmail.com> writes: > > > Are >> > there any built-in ways to do something like this already? Or do I >> > just have to iterate over the unicode string? >> >> Converting each character to utf8 and checking the total number of >> bytes so far? >> Ooooh, sloooowwwwww! >> >> > Somewhat faster:
What's wrong with Peter Otten's solution? >>> u"äöü".encode("utf8")[:5].decode("utf8", "ignore") u'\xe4\xf6' At most, you should have one error, at the very end. If you ignore it, you get the unicode characters that have length <= 5 in *bytes* when encoded as UTF-8. (If you encode using a different codec, you will likely get a different number of bytes.) -- Steven -- http://mail.python.org/mailman/listinfo/python-list