Jörg: I case you want to see the previous discussions on the subject, here they are:
* "data for cp1252" http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0233".html<http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0233.html> * "cp1252 decoder implementation" http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0167.html * tangential "latin1 decoder implementation" http://www.unicode.org/mail-arch/unicode-ml/y2012-m11/0146.html On Wed, Jan 29, 2014 at 10:21 AM, Buck Golemon <b...@yelp.com> wrote: > Jörg: > > This is the definition of cp1252 used by the whatwg and all current > browser implementations. > I've appealed to the cp1252 maintainer to update the definition so that we > don't have two competing standards, but I was rejected. > I've been considering naming it cp1252-whatwg. > > > On Wed, Jan 29, 2014 at 6:59 AM, "Jörg Knappen" <jknap...@web.de> wrote: > >> A little postscrptum to this old thread: >> >> On pyPi, there is now a codec available that handles the peculiar >> definition of "latin1" inside mysql. >> The package is called mysql-latin1-codec and features an encoding >> consisting of cp1252 plus >> 0x81, 0x8D, 0x8F, 0x90, 0x9D (the latter five characters are undefined in >> the python codec for cp1252). >> >> https://pypi.python.org/pypi/mysql-latin1-codec/1.0 >> >> --Jörg Knappen >> >> *Gesendet:* Mittwoch, 30. Oktober 2013 um 19:14 Uhr >> *Von:* "Buck Golemon" <b...@yelp.com> >> *An:* "Frédéric Grosshans" <frederic.grossh...@gmail.com> >> *Cc:* "Jörg Knappen" <jknap...@web.de>, unicode <unicode@unicode.org> >> *Betreff:* Re: Aw: Re: Re: Re: Re: Do you know a tool to decode "UTF-8 >> twice" >> >> >> On Wed, Oct 30, 2013 at 9:56 AM, Frédéric Grosshans < >> frederic.grossh...@gmail.com> wrote: >>> >>> Le 30/10/2013 17:32, "Jörg Knappen" a écrit : >>> >>>> >>>> The data did not only contain latin-1 type mangling for the >>>> non-existent Windows characters, but also sequences with the raw >>>> C1 control characters for all of latin-1. So I had to do them, too. >>>> The data weren't consistent at all, not even in their errors. >>>> --Jörg Knappen >>> >>> Your question helped me dust off and repair a non working python >>> snippet I wrote for a similar problem. I was stuck with the mixing of >>> windows-1252 and latin1 controls (linked with a chinese characters). I >>> write it below for reference. >>> >>> The python snippet below does not need sed, defines a function >>> (unscramble(S)) which works on strings. The extension to files should be >>> easy. >>> >>> Frédéric Grosshans >>> >>> >>> def Step1Filter(S): >>> for c in S : >>> #works character/character because of the cp1252/latin1 ambiguity >>> try : >>> yield c.encode('cp1252') >>> except UnicodeEncodeError : >>> yield c.encode('latin1') >>> #Useful where cp1252 is undefined (81, 8D, 8F, 90, 9D) >>> >>> def unscramble(S): >>> return b''.join(c for c in Step1Filter(S)).decode('utf8') >>> >>> PS: If anyone is interested in a licence, I consider this simple enough >>> to be in the public domain an uncopyrightable. >>> >> >> This encoding you've implemented above is known as windows-1252 by the >> whatwg and all browsers [1][2]. >> The implementation of cp1252 in python is instead a direct consequence of >> the unicode.org definition [3]. >> >> [1] http://encoding.spec.whatwg.org/index-windows-1252.txt >> [2] http://bukzor.github.io/encodings/cp1252.html >> [3] >> http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT >> > >
_______________________________________________ Unicode mailing list Unicode@unicode.org http://unicode.org/mailman/listinfo/unicode