Re: Re: Do you know a tool to decode "UTF-8 twice"

Buck Golemon Mon, 28 Oct 2013 10:24:06 -0700

On Mon, Oct 28, 2013 at 9:48 AM, Buck Golemon <b...@yelp.com> wrote:

>
>
>
> On Mon, Oct 28, 2013 at 6:06 AM, "Jörg Knappen" <jknap...@web.de> wrote:
>
>> Hi Steffen,
>>
>> data aren't that easy. There are non-latin1-characters encoded in the
>> UTF8 part. I expect
>> among others typographic apostrophes, polish characters, some
>> mediaevalist characters like
>> ũ (u with tilde). Maybe, there is also some greek inside, but I am not
>> sure about that.
>>
>> --Jörg Knappen
>>
>> *Gesendet:* Montag, 28. Oktober 2013 um 12:34 Uhr
>> *Von:* "Steffen \"Daode\" Nurpmeso" <sdao...@gmail.com>
>> *An:* "Jörg Knappen" <jknap...@web.de>
>> *Cc:* unicode@unicode.org
>> *Betreff:* Re: Do you know a tool to decode "UTF-8 twice"
>> "Jörg Knappen" <jknap...@web.de> wrote:
>> | Is there a ready made tool that decodes "UTF-8 twice" while keeping
>> | UTF-8 proper in place?
>>
>> Isn't a shell script with a truly validating iconv(1) enough?
>> This works for me if in utf8.1 there is 'ÄEIÖÜ' in UTF-8 and i run
>>
>> ?0[steffen@sherwood tmp]$ iconv -f latin1 -t utf8 < utf8.1 > utf8.2
>>
>> As in
>>
>> for i in utf8.1 utf8.2; do
>> if iconv -f utf8 -t latin1 < ${i} |
>> iconv -f utf8 -t utf8 >/dev/null 2>&1; then
>> echo ${i}: bummer, going home by one
>> iconv -f utf8 -t latin1 < ${i} > ${i}.new 2>&1
>> else
>> echo ${i}: valid UTF-8
>> fi
>> done
>>
>> i'll end up as
>>
>> ?0[steffen@sherwood tmp]$ sh utf8dec.sh
>> utf8.1: valid UTF-8
>> utf8.2: bummer, going home by one
>> ?0[steffen@sherwood tmp]$
>>
>> Ciao,
>>
>> | --Jörg Knappen
>>
>> --steffen
>>
>
> Jörg: There's no ready-made tool, but it's easy to write in python.
> I'll provide you a well-tested function in a few minutes.
>
>
>
>
Jörg:


Here is my function (also attached):
http://paste.pound-python.org/show/jAfKzb5HEyOeGvyF7O9W/
You can either make a larger python program with it, or expose it directly
to shell scripting.
These are the tests passed:

A latin1-encoded string should become utf8-encoded ... ok
An un-encoded unicode string should just become utf8-encoded ... ok
A utf8-encoded string should be unchanged ... ok
A poorly-encoded utf8+latin1 string should become utf8-encoded ... ok
A string mangled by utf8+latin1 several times should become utf8-encoded
... ok

----------------------------------------------------------------------
Ran 5 tests in 0.001s

OK

# -*- coding: UTF-8 -*-
def recode_utf8(data):
        """
        Given a string which is either:
         * unicode
         * well-encoded utf8
         * well-encoded latin1
         * poorly-encoded utf8+latin1
        Return the equivalent utf8-encoded byte string.
        """
        if isinstance(data, unicode):
                # The input is already decoded. Just return the utf8.
                return data.encode('UTF-8')

        try:
                decoded = data.decode('UTF-8')
        except UnicodeDecodeError:
                # Indicates latin1 encoded bytes.
                decoded = data.decode('latin1')

        while True:
                # Check if the data is poorly-encoded as utf8+latin1
                try:
                        encoded = decoded.encode('latin1')
                except UnicodeEncodeError:
                        # Indicates non-latin1-encodable characters; it's not 
utf8+latin1.
                        return decoded.encode('UTF-8')

                try:
                        decoded = encoded.decode('UTF-8')
                except UnicodeDecodeError:
                        # Can't decode the latin1 as utf8; it's not utf8+latin1.
                        return decoded.encode('UTF-8')


import unittest as T
class TestRecodeUtf8(T.TestCase):
        latin1 = u'München' # encodable to latin1
        utf8 = u'Łódź' # not encodable to latin1

        def test_unicode(self):
                "An un-encoded unicode string should just become utf8-encoded"
                self.assertEqual(
                                recode_utf8(self.utf8),
                                self.utf8.encode('UTF-8'),
                )

        def test_utf8(self):
                "A utf8-encoded string should be unchanged"
                utf8 = self.utf8.encode('UTF-8')
                self.assertEqual(
                                recode_utf8(utf8),
                                utf8,
                )

        def test_latin1(self):
                "A latin1-encoded string should become utf8-encoded"
                self.assertEqual(
                                recode_utf8(self.latin1.encode('latin1')),
                                self.latin1.encode('UTF-8'),
                )

        def test_utf8_plus_latin1(self):
                "A poorly-encoded utf8+latin1 string should become utf8-encoded"
                utf8 = self.utf8.encode('UTF-8')
                poorly_encoded = utf8.decode('latin1').encode('UTF-8')
                self.assertEqual(
                                recode_utf8(poorly_encoded),
                                utf8,
                )

        def test_utf8_plus_latin1_several_times(self):
                "A string mangled by utf8+latin1 several times should become 
utf8-encoded"
                utf8 = self.utf8.encode('UTF-8')
                poorly_encoded = utf8
                for x in range(10):
                        poorly_encoded = 
poorly_encoded.decode('latin1').encode('UTF-8')

                self.assertEqual(
                                recode_utf8(poorly_encoded),
                                utf8,
                )



if __name__ == '__main__':
        T.main()

Re: Re: Do you know a tool to decode "UTF-8 twice"

Reply via email to