[Python-Dev] New codecs checked in
I've checked in a whole bunch of newly generated codecs which now make use of the faster charmap decoding variant added by Walter a short while ago. Please let me know if you find any problems. Some codecs (esp. the Mac OS X ones) have minor changes. These originate from updated mapping files on ftp.unicode.org. I also added an alias iso8859_1 -> latin_1, so that applications using the iso8859_1 encoding name can benefit from the faster native implementation of the latin_1 codec. Regards, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 22 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
M.-A. Lemburg wrote: > I've checked in a whole bunch of newly generated codecs > which now make use of the faster charmap decoding variant added > by Walter a short while ago. > > Please let me know if you find any problems. I think we should work on eliminating the decoding_map variables. There are some codecs which rely on them being present in other codecs (e.g. koi8_u.py is based on koi8_r.py); however, this could be updated to use, say decoding_table = codecs.update_decoding_map(koi8_r.decoding_table, { 0x00a4: 0x0454, # CYRILLIC SMALL LETTER UKRAINIAN IE 0x00a6: 0x0456, # CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I 0x00a7: 0x0457, # CYRILLIC SMALL LETTER YI (UKRAINIAN) 0x00ad: 0x0491, # CYRILLIC SMALL LETTER UKRAINIAN GHE WITH UPTURN 0x00b4: 0x0404, # CYRILLIC CAPITAL LETTER UKRAINIAN IE 0x00b6: 0x0406, # CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I 0x00b7: 0x0407, # CYRILLIC CAPITAL LETTER YI (UKRAINIAN) 0x00bd: 0x0490, # CYRILLIC CAPITAL LETTER UKRAINIAN GHE WITH UPTURN }) With all these cross-references gone, the decoding_maps could also go. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
Martin v. Löwis wrote: > M.-A. Lemburg wrote: > >>I've checked in a whole bunch of newly generated codecs >>which now make use of the faster charmap decoding variant added >>by Walter a short while ago. >> >>Please let me know if you find any problems. > > I think we should work on eliminating the decoding_map variables. > There are some codecs which rely on them being present in other codecs > (e.g. koi8_u.py is based on koi8_r.py); however, this could be updated > to use, say > > decoding_table = codecs.update_decoding_map(koi8_r.decoding_table, { > 0x00a4: 0x0454, # CYRILLIC SMALL LETTER UKRAINIAN IE > 0x00a6: 0x0456, # CYRILLIC SMALL LETTER > BYELORUSSIAN-UKRAINIAN I > 0x00a7: 0x0457, # CYRILLIC SMALL LETTER YI (UKRAINIAN) > 0x00ad: 0x0491, # CYRILLIC SMALL LETTER UKRAINIAN GHE > WITH UPTURN > 0x00b4: 0x0404, # CYRILLIC CAPITAL LETTER UKRAINIAN IE > 0x00b6: 0x0406, # CYRILLIC CAPITAL LETTER > BYELORUSSIAN-UKRAINIAN I > 0x00b7: 0x0407, # CYRILLIC CAPITAL LETTER YI (UKRAINIAN) > 0x00bd: 0x0490, # CYRILLIC CAPITAL LETTER UKRAINIAN GHE > WITH UPTURN > }) > > With all these cross-references gone, the decoding_maps could also go. Why should koi_u.py be defined in terms of koi8_r.py anyway? Why not put a complete decoding_table into koi8_u.py? I'd like to suggest a small cosmetic change: gencodec.py should output byte values with two hexdigits instead of four. This makes it easier to see what is a byte values and what is a codepoint. And it would make grepping for stuff simpler. I.e. change: decoding_map.update({ 0x0080: 0x0402, # CYRILLIC CAPITAL LETTER DJE to decoding_map.update({ 0x80: 0x0402, # CYRILLIC CAPITAL LETTER DJE and decoding_table = ( u'\x00' # 0x -> NULL to decoding_table = ( u'\x00' # 0x00 -> U+ NULL and encoding_map = { 0x: 0x, # NULL to encoding_map = { 0x: 0x00, # NULL ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
Walter Dörwald wrote: > Martin v. Löwis wrote: > >> M.-A. Lemburg wrote: >> >>> I've checked in a whole bunch of newly generated codecs >>> which now make use of the faster charmap decoding variant added >>> by Walter a short while ago. >>> >>> Please let me know if you find any problems. >> >> >> I think we should work on eliminating the decoding_map variables. >> There are some codecs which rely on them being present in other codecs >> (e.g. koi8_u.py is based on koi8_r.py); however, this could be updated >> to use, say >> >> decoding_table = codecs.update_decoding_map(koi8_r.decoding_table, { >> 0x00a4: 0x0454, # CYRILLIC SMALL LETTER UKRAINIAN IE >> 0x00a6: 0x0456, # CYRILLIC SMALL LETTER >> BYELORUSSIAN-UKRAINIAN I >> 0x00a7: 0x0457, # CYRILLIC SMALL LETTER YI (UKRAINIAN) >> 0x00ad: 0x0491, # CYRILLIC SMALL LETTER UKRAINIAN GHE >> WITH UPTURN >> 0x00b4: 0x0404, # CYRILLIC CAPITAL LETTER UKRAINIAN IE >> 0x00b6: 0x0406, # CYRILLIC CAPITAL LETTER >> BYELORUSSIAN-UKRAINIAN I >> 0x00b7: 0x0407, # CYRILLIC CAPITAL LETTER YI (UKRAINIAN) >> 0x00bd: 0x0490, # CYRILLIC CAPITAL LETTER UKRAINIAN GHE >> WITH UPTURN >> }) >> >> With all these cross-references gone, the decoding_maps could also go. I just left them in because I thought they wouldn't do any harm and might be useful in some applications. Removing them where not directly needed by the codec would not be a problem. > Why should koi_u.py be defined in terms of koi8_r.py anyway? Why not put > a complete decoding_table into koi8_u.py? KOI8-U is not available as mapping on ftp.unicode.org and I only recreated codecs from the mapping files available there. > I'd like to suggest a small cosmetic change: gencodec.py should output > byte values with two hexdigits instead of four. This makes it easier to > see what is a byte values and what is a codepoint. And it would make > grepping for stuff simpler. True. I'll rerun the creation with the above changes sometime this week. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 24 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
M.-A. Lemburg wrote: > Walter Dörwald wrote: > >>Martin v. Löwis wrote: >> >>>M.-A. Lemburg wrote: >>> I've checked in a whole bunch of newly generated codecs which now make use of the faster charmap decoding variant added by Walter a short while ago. Please let me know if you find any problems. >>> >>>I think we should work on eliminating the decoding_map variables. >>>There are some codecs which rely on them being present in other codecs >>>(e.g. koi8_u.py is based on koi8_r.py); however, this could be updated >>>to use, say >>> >>>decoding_table = codecs.update_decoding_map(koi8_r.decoding_table, { >>> 0x00a4: 0x0454, # CYRILLIC SMALL LETTER UKRAINIAN IE >>> 0x00a6: 0x0456, # CYRILLIC SMALL LETTER >>>BYELORUSSIAN-UKRAINIAN I >>> 0x00a7: 0x0457, # CYRILLIC SMALL LETTER YI (UKRAINIAN) >>> 0x00ad: 0x0491, # CYRILLIC SMALL LETTER UKRAINIAN GHE >>>WITH UPTURN >>> 0x00b4: 0x0404, # CYRILLIC CAPITAL LETTER UKRAINIAN IE >>> 0x00b6: 0x0406, # CYRILLIC CAPITAL LETTER >>>BYELORUSSIAN-UKRAINIAN I >>> 0x00b7: 0x0407, # CYRILLIC CAPITAL LETTER YI (UKRAINIAN) >>> 0x00bd: 0x0490, # CYRILLIC CAPITAL LETTER UKRAINIAN GHE >>>WITH UPTURN >>>}) >>> >>>With all these cross-references gone, the decoding_maps could also go. > > I just left them in because I thought they wouldn't do any harm > and might be useful in some applications. > > Removing them where not directly needed by the codec would not > be a problem. Recreating them is quite simple via dict(enumerate(decoding_table)) so I think we should remove them. >>Why should koi_u.py be defined in terms of koi8_r.py anyway? Why not put >>a complete decoding_table into koi8_u.py? > > KOI8-U is not available as mapping on ftp.unicode.org and > I only recreated codecs from the mapping files available > there. OK, so we'd need something that creates a new decoding table from an old one + changes, i.e. something like: def update_decoding_table(table, new): table = list[table] for (key, value) in new.iteritems(): table[key] = unichr(value) return u"".join(table) >>I'd like to suggest a small cosmetic change: gencodec.py should output >>byte values with two hexdigits instead of four. This makes it easier to >>see what is a byte values and what is a codepoint. And it would make >>grepping for stuff simpler. > > True. > > I'll rerun the creation with the above changes sometime this > week. Great, thanks! Bye, Walter Dörwald ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
Walter Dörwald wrote: >>>I'd like to suggest a small cosmetic change: gencodec.py should output >>>byte values with two hexdigits instead of four. This makes it easier to >>>see what is a byte values and what is a codepoint. And it would make >>>grepping for stuff simpler. >> >>True. >> >>I'll rerun the creation with the above changes sometime this >>week. > > > Great, thanks! Done. I had to create three custom mapping files for cp1140, koi8-u and tis-620. If you want more non-standard charmap codecs converted, please send me the mapping files in the Unicode standard format for these files. Thanks, -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 24 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
Walter Dörwald wrote: > Why should koi_u.py be defined in terms of koi8_r.py anyway? Why not put > a complete decoding_table into koi8_u.py? Not sure. Unfortunately, the tables being used as source are not part of the Python source, so nobody except MAL can faithfully regenerate them. If they were part of the Python source, explicitly adding one for KOI8-U would certainly be feasible. > I.e. change: > > decoding_map.update({ > 0x0080: 0x0402, # CYRILLIC CAPITAL LETTER DJE Hmm. I was suggesting to remove decoding_map completely, in which case neither the current form nor your suggested cosmetic change would survive. > to > > decoding_table = ( > u'\x00' # 0x00 -> U+ NULL Using U+ in comments to denote the codepoints is a good idea, anyway. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
M.-A. Lemburg wrote: > I just left them in because I thought they wouldn't do any harm > and might be useful in some applications. > > Removing them where not directly needed by the codec would not > be a problem. I think memory usage caused is measurable (I estimated 4KiB per dictionary). More importantly, people apparently currently change the dictionaries we provide and expect the codecs to automatically pick up the modified mappings. It would be better if the breakage is explicit (i.e. they get an AttributeError on the variable) instead of implicit (their changes to the mapping simply have no effect anymore). > KOI8-U is not available as mapping on ftp.unicode.org and > I only recreated codecs from the mapping files available > there. I think we should come up with mapping tables for the additional codecs as well, and maintain them in the CVS. This also applies to things like rot13. > I'll rerun the creation with the above changes sometime this > week. I hope I can finish my encoding routine shortly, which again results in changes to the codecs (replacing the encoding dictionaries with other lookup tables). Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
M.-A. Lemburg wrote: > I had to create three custom mapping files for cp1140, koi8-u > and tis-620. Can you please publish the files you have used somewhere? They best go into the Python CVS. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
Martin v. Löwis wrote: > M.-A. Lemburg wrote: > > >>I had to create three custom mapping files for cp1140, koi8-u >>and tis-620. > > > Can you please publish the files you have used somewhere? They > best go into the Python CVS. Sure; I'll check in the whole build machinery I'm using for this. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
Martin v. Löwis wrote: > M.-A. Lemburg wrote: > >>I just left them in because I thought they wouldn't do any harm >>and might be useful in some applications. >> >>Removing them where not directly needed by the codec would not >>be a problem. > > > I think memory usage caused is measurable (I estimated 4KiB per > dictionary). More importantly, people apparently currently change > the dictionaries we provide and expect the codecs to automatically > pick up the modified mappings. It would be better if the breakage > is explicit (i.e. they get an AttributeError on the variable) instead > of implicit (their changes to the mapping simply have no effect > anymore). Agreed. I've already checked in the changes, BTW. >>KOI8-U is not available as mapping on ftp.unicode.org and >>I only recreated codecs from the mapping files available >>there. > > > I think we should come up with mapping tables for the additional > codecs as well, and maintain them in the CVS. This also applies > to things like rot13. Agreed. >>I'll rerun the creation with the above changes sometime this >>week. > > > I hope I can finish my encoding routine shortly, which again > results in changes to the codecs (replacing the encoding dictionaries > with other lookup tables). Having seen the decode tables written as long Unicode string, I think that this may indeed also be a good solution for encoding - the major improvement here is that the parser and compiler will do the work of creating the table. At module load time, the .pyc file will only contain a long string which is very fast to create and load (unlike dictionaries which are set up dynamically at load time). In general, it's better to do all the work up-front when creating the codecs, rather than having run-time code repeat these tasks over and over again. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
M.-A. Lemburg wrote: > Martin v. Löwis wrote: > >>M.-A. Lemburg wrote: >> >> >> >>>I had to create three custom mapping files for cp1140, koi8-u >>>and tis-620. >> >> >>Can you please publish the files you have used somewhere? They >>best go into the Python CVS. > > > Sure; I'll check in the whole build machinery I'm using for this. Done. In order to rebuild the codecs, cd Tools/unicode; make then check the codecs in the created build/ subdir (e.g. using comparecodecs.py) and copy them over to the Lib/encodings/ directory. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 25 2005) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] New codecs checked in
M.-A. Lemburg wrote: > Done. > > In order to rebuild the codecs, cd Tools/unicode; make > then check the codecs in the created build/ subdir (e.g. > using comparecodecs.py) and copy them over to the > Lib/encodings/ directory. Thanks! Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com