On Mon, Oct 28, 2013 at 9:48 AM, Buck Golemon <b...@yelp.com> wrote: > > > > On Mon, Oct 28, 2013 at 6:06 AM, "Jörg Knappen" <jknap...@web.de> wrote: > >> Hi Steffen, >> >> data aren't that easy. There are non-latin1-characters encoded in the >> UTF8 part. I expect >> among others typographic apostrophes, polish characters, some >> mediaevalist characters like >> ũ (u with tilde). Maybe, there is also some greek inside, but I am not >> sure about that. >> >> --Jörg Knappen >> >> *Gesendet:* Montag, 28. Oktober 2013 um 12:34 Uhr >> *Von:* "Steffen \"Daode\" Nurpmeso" <sdao...@gmail.com> >> *An:* "Jörg Knappen" <jknap...@web.de> >> *Cc:* unicode@unicode.org >> *Betreff:* Re: Do you know a tool to decode "UTF-8 twice" >> "Jörg Knappen" <jknap...@web.de> wrote: >> | Is there a ready made tool that decodes "UTF-8 twice" while keeping >> | UTF-8 proper in place? >> >> Isn't a shell script with a truly validating iconv(1) enough? >> This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run >> >> ?0[steffen@sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2 >> >> As in >> >> for i in utf8.1 utf8.2; do >> if iconv -f utf8 -t latin1 < ${i} | >> iconv -f utf8 -t utf8 >/dev/null 2>&1; then >> echo ${i}: bummer, going home by one >> iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1 >> else >> echo ${i}: valid UTF-8 >> fi >> done >> >> i'll end up as >> >> ?0[steffen@sherwood tmp]$ sh utf8dec.sh >> utf8.1: valid UTF-8 >> utf8.2: bummer, going home by one >> ?0[steffen@sherwood tmp]$ >> >> Ciao, >> >> | --Jörg Knappen >> >> --steffen >> > > Jörg: There's no ready-made tool, but it's easy to write in python. > I'll provide you a well-tested function in a few minutes. > > > > Jörg:
Here is my function (also attached): http://paste.pound-python.org/show/jAfKzb5HEyOeGvyF7O9W/ You can either make a larger python program with it, or expose it directly to shell scripting. These are the tests passed: A latin1-encoded string should become utf8-encoded ... ok An un-encoded unicode string should just become utf8-encoded ... ok A utf8-encoded string should be unchanged ... ok A poorly-encoded utf8+latin1 string should become utf8-encoded ... ok A string mangled by utf8+latin1 several times should become utf8-encoded ... ok ---------------------------------------------------------------------- Ran 5 tests in 0.001s OK
# -*- coding: UTF-8 -*- def recode_utf8(data): """ Given a string which is either: * unicode * well-encoded utf8 * well-encoded latin1 * poorly-encoded utf8+latin1 Return the equivalent utf8-encoded byte string. """ if isinstance(data, unicode): # The input is already decoded. Just return the utf8. return data.encode('UTF-8') try: decoded = data.decode('UTF-8') except UnicodeDecodeError: # Indicates latin1 encoded bytes. decoded = data.decode('latin1') while True: # Check if the data is poorly-encoded as utf8+latin1 try: encoded = decoded.encode('latin1') except UnicodeEncodeError: # Indicates non-latin1-encodable characters; it's not utf8+latin1. return decoded.encode('UTF-8') try: decoded = encoded.decode('UTF-8') except UnicodeDecodeError: # Can't decode the latin1 as utf8; it's not utf8+latin1. return decoded.encode('UTF-8') import unittest as T class TestRecodeUtf8(T.TestCase): latin1 = u'München' # encodable to latin1 utf8 = u'Łódź' # not encodable to latin1 def test_unicode(self): "An un-encoded unicode string should just become utf8-encoded" self.assertEqual( recode_utf8(self.utf8), self.utf8.encode('UTF-8'), ) def test_utf8(self): "A utf8-encoded string should be unchanged" utf8 = self.utf8.encode('UTF-8') self.assertEqual( recode_utf8(utf8), utf8, ) def test_latin1(self): "A latin1-encoded string should become utf8-encoded" self.assertEqual( recode_utf8(self.latin1.encode('latin1')), self.latin1.encode('UTF-8'), ) def test_utf8_plus_latin1(self): "A poorly-encoded utf8+latin1 string should become utf8-encoded" utf8 = self.utf8.encode('UTF-8') poorly_encoded = utf8.decode('latin1').encode('UTF-8') self.assertEqual( recode_utf8(poorly_encoded), utf8, ) def test_utf8_plus_latin1_several_times(self): "A string mangled by utf8+latin1 several times should become utf8-encoded" utf8 = self.utf8.encode('UTF-8') poorly_encoded = utf8 for x in range(10): poorly_encoded = poorly_encoded.decode('latin1').encode('UTF-8') self.assertEqual( recode_utf8(poorly_encoded), utf8, ) if __name__ == '__main__': T.main()