[Python-Dev] New codecs checked in

2005-10-21 Thread M.-A. Lemburg
I've checked in a whole bunch of newly generated codecs
which now make use of the faster charmap decoding variant added
by Walter a short while ago.

Please let me know if you find any problems.

Some codecs (esp. the Mac OS X ones) have minor changes.
These originate from updated mapping files on ftp.unicode.org.

I also added an alias iso8859_1 -> latin_1, so that applications
using the iso8859_1 encoding name can benefit from the faster
native implementation of the latin_1 codec.

Regards,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 22 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-23 Thread Martin v. Löwis
M.-A. Lemburg wrote:
> I've checked in a whole bunch of newly generated codecs
> which now make use of the faster charmap decoding variant added
> by Walter a short while ago.
> 
> Please let me know if you find any problems.

I think we should work on eliminating the decoding_map variables.
There are some codecs which rely on them being present in other codecs
(e.g. koi8_u.py is based on koi8_r.py); however, this could be updated
to use, say

decoding_table = codecs.update_decoding_map(koi8_r.decoding_table, {
 0x00a4: 0x0454, #   CYRILLIC SMALL LETTER UKRAINIAN IE
 0x00a6: 0x0456, #   CYRILLIC SMALL LETTER 
BYELORUSSIAN-UKRAINIAN I
 0x00a7: 0x0457, #   CYRILLIC SMALL LETTER YI (UKRAINIAN)
 0x00ad: 0x0491, #   CYRILLIC SMALL LETTER UKRAINIAN GHE 
WITH UPTURN
 0x00b4: 0x0404, #   CYRILLIC CAPITAL LETTER UKRAINIAN IE
 0x00b6: 0x0406, #   CYRILLIC CAPITAL LETTER 
BYELORUSSIAN-UKRAINIAN I
 0x00b7: 0x0407, #   CYRILLIC CAPITAL LETTER YI (UKRAINIAN)
 0x00bd: 0x0490, #   CYRILLIC CAPITAL LETTER UKRAINIAN GHE 
WITH UPTURN
})

With all these cross-references gone, the decoding_maps could also go.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-24 Thread Walter Dörwald
Martin v. Löwis wrote:

> M.-A. Lemburg wrote:
> 
>>I've checked in a whole bunch of newly generated codecs
>>which now make use of the faster charmap decoding variant added
>>by Walter a short while ago.
>>
>>Please let me know if you find any problems.
> 
> I think we should work on eliminating the decoding_map variables.
> There are some codecs which rely on them being present in other codecs
> (e.g. koi8_u.py is based on koi8_r.py); however, this could be updated
> to use, say
> 
> decoding_table = codecs.update_decoding_map(koi8_r.decoding_table, {
>  0x00a4: 0x0454, #   CYRILLIC SMALL LETTER UKRAINIAN IE
>  0x00a6: 0x0456, #   CYRILLIC SMALL LETTER 
> BYELORUSSIAN-UKRAINIAN I
>  0x00a7: 0x0457, #   CYRILLIC SMALL LETTER YI (UKRAINIAN)
>  0x00ad: 0x0491, #   CYRILLIC SMALL LETTER UKRAINIAN GHE 
> WITH UPTURN
>  0x00b4: 0x0404, #   CYRILLIC CAPITAL LETTER UKRAINIAN IE
>  0x00b6: 0x0406, #   CYRILLIC CAPITAL LETTER 
> BYELORUSSIAN-UKRAINIAN I
>  0x00b7: 0x0407, #   CYRILLIC CAPITAL LETTER YI (UKRAINIAN)
>  0x00bd: 0x0490, #   CYRILLIC CAPITAL LETTER UKRAINIAN GHE 
> WITH UPTURN
> })
> 
> With all these cross-references gone, the decoding_maps could also go.

Why should koi_u.py be defined in terms of koi8_r.py anyway? Why not put 
a complete decoding_table into koi8_u.py?

I'd like to suggest a small cosmetic change: gencodec.py should output 
byte values with two hexdigits instead of four. This makes it easier to 
see what is a byte values and what is a codepoint. And it would make 
grepping for stuff simpler.

I.e. change:

decoding_map.update({
 0x0080: 0x0402, #  CYRILLIC CAPITAL LETTER DJE

to

decoding_map.update({
 0x80: 0x0402, #  CYRILLIC CAPITAL LETTER DJE

and

decoding_table = (
 u'\x00' #  0x -> NULL

to

decoding_table = (
 u'\x00' # 0x00 -> U+ NULL

and

encoding_map = {
 0x: 0x, #  NULL

to

encoding_map = {
 0x: 0x00, #  NULL
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-24 Thread M.-A. Lemburg
Walter Dörwald wrote:
> Martin v. Löwis wrote:
> 
>> M.-A. Lemburg wrote:
>>
>>> I've checked in a whole bunch of newly generated codecs
>>> which now make use of the faster charmap decoding variant added
>>> by Walter a short while ago.
>>>
>>> Please let me know if you find any problems.
>>
>>
>> I think we should work on eliminating the decoding_map variables.
>> There are some codecs which rely on them being present in other codecs
>> (e.g. koi8_u.py is based on koi8_r.py); however, this could be updated
>> to use, say
>>
>> decoding_table = codecs.update_decoding_map(koi8_r.decoding_table, {
>>  0x00a4: 0x0454, #   CYRILLIC SMALL LETTER UKRAINIAN IE
>>  0x00a6: 0x0456, #   CYRILLIC SMALL LETTER
>> BYELORUSSIAN-UKRAINIAN I
>>  0x00a7: 0x0457, #   CYRILLIC SMALL LETTER YI (UKRAINIAN)
>>  0x00ad: 0x0491, #   CYRILLIC SMALL LETTER UKRAINIAN GHE
>> WITH UPTURN
>>  0x00b4: 0x0404, #   CYRILLIC CAPITAL LETTER UKRAINIAN IE
>>  0x00b6: 0x0406, #   CYRILLIC CAPITAL LETTER
>> BYELORUSSIAN-UKRAINIAN I
>>  0x00b7: 0x0407, #   CYRILLIC CAPITAL LETTER YI (UKRAINIAN)
>>  0x00bd: 0x0490, #   CYRILLIC CAPITAL LETTER UKRAINIAN GHE
>> WITH UPTURN
>> })
>>
>> With all these cross-references gone, the decoding_maps could also go.

I just left them in because I thought they wouldn't do any harm
and might be useful in some applications.

Removing them where not directly needed by the codec would not
be a problem.

> Why should koi_u.py be defined in terms of koi8_r.py anyway? Why not put
> a complete decoding_table into koi8_u.py?

KOI8-U is not available as mapping on ftp.unicode.org and
I only recreated codecs from the mapping files available
there.

> I'd like to suggest a small cosmetic change: gencodec.py should output
> byte values with two hexdigits instead of four. This makes it easier to
> see what is a byte values and what is a codepoint. And it would make
> grepping for stuff simpler.

True.

I'll rerun the creation with the above changes sometime this
week.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 24 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-24 Thread Walter Dörwald
M.-A. Lemburg wrote:

> Walter Dörwald wrote:
> 
>>Martin v. Löwis wrote:
>>
>>>M.-A. Lemburg wrote:
>>>
I've checked in a whole bunch of newly generated codecs
which now make use of the faster charmap decoding variant added
by Walter a short while ago.

Please let me know if you find any problems.
>>>
>>>I think we should work on eliminating the decoding_map variables.
>>>There are some codecs which rely on them being present in other codecs
>>>(e.g. koi8_u.py is based on koi8_r.py); however, this could be updated
>>>to use, say
>>>
>>>decoding_table = codecs.update_decoding_map(koi8_r.decoding_table, {
>>> 0x00a4: 0x0454, #   CYRILLIC SMALL LETTER UKRAINIAN IE
>>> 0x00a6: 0x0456, #   CYRILLIC SMALL LETTER
>>>BYELORUSSIAN-UKRAINIAN I
>>> 0x00a7: 0x0457, #   CYRILLIC SMALL LETTER YI (UKRAINIAN)
>>> 0x00ad: 0x0491, #   CYRILLIC SMALL LETTER UKRAINIAN GHE
>>>WITH UPTURN
>>> 0x00b4: 0x0404, #   CYRILLIC CAPITAL LETTER UKRAINIAN IE
>>> 0x00b6: 0x0406, #   CYRILLIC CAPITAL LETTER
>>>BYELORUSSIAN-UKRAINIAN I
>>> 0x00b7: 0x0407, #   CYRILLIC CAPITAL LETTER YI (UKRAINIAN)
>>> 0x00bd: 0x0490, #   CYRILLIC CAPITAL LETTER UKRAINIAN GHE
>>>WITH UPTURN
>>>})
>>>
>>>With all these cross-references gone, the decoding_maps could also go.
> 
> I just left them in because I thought they wouldn't do any harm
> and might be useful in some applications.
 >
> Removing them where not directly needed by the codec would not
> be a problem.

Recreating them is quite simple via dict(enumerate(decoding_table)) so I 
think we should remove them.

>>Why should koi_u.py be defined in terms of koi8_r.py anyway? Why not put
>>a complete decoding_table into koi8_u.py?
> 
> KOI8-U is not available as mapping on ftp.unicode.org and
> I only recreated codecs from the mapping files available
> there.

OK, so we'd need something that creates a new decoding table from an old 
one + changes, i.e. something like:

def update_decoding_table(table, new):
table = list[table]
for (key, value) in new.iteritems():
   table[key] = unichr(value)
return u"".join(table)

>>I'd like to suggest a small cosmetic change: gencodec.py should output
>>byte values with two hexdigits instead of four. This makes it easier to
>>see what is a byte values and what is a codepoint. And it would make
>>grepping for stuff simpler.
> 
> True.
> 
> I'll rerun the creation with the above changes sometime this
> week.

Great, thanks!

Bye,
Walter Dörwald
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-24 Thread M.-A. Lemburg
Walter Dörwald wrote:
>>>I'd like to suggest a small cosmetic change: gencodec.py should output
>>>byte values with two hexdigits instead of four. This makes it easier to
>>>see what is a byte values and what is a codepoint. And it would make
>>>grepping for stuff simpler.
>>
>>True.
>>
>>I'll rerun the creation with the above changes sometime this
>>week.
> 
> 
> Great, thanks!

Done.

I had to create three custom mapping files for cp1140, koi8-u
and tis-620.

If you want more non-standard charmap codecs converted, please
send me the mapping files in the Unicode standard format for
these files.

Thanks,
-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 24 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-24 Thread Martin v. Löwis
Walter Dörwald wrote:
> Why should koi_u.py be defined in terms of koi8_r.py anyway? Why not put 
> a complete decoding_table into koi8_u.py?

Not sure. Unfortunately, the tables being used as source are not part of
the Python source, so nobody except MAL can faithfully regenerate them.
If they were part of the Python source, explicitly adding one for
KOI8-U would certainly be feasible.

> I.e. change:
> 
> decoding_map.update({
> 0x0080: 0x0402, #  CYRILLIC CAPITAL LETTER DJE

Hmm. I was suggesting to remove decoding_map completely, in which
case neither the current form nor your suggested cosmetic change
would survive.

> to
> 
> decoding_table = (
> u'\x00' # 0x00 -> U+ NULL

Using U+ in comments to denote the codepoints is a good idea,
anyway.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-24 Thread Martin v. Löwis
M.-A. Lemburg wrote:
> I just left them in because I thought they wouldn't do any harm
> and might be useful in some applications.
> 
> Removing them where not directly needed by the codec would not
> be a problem.

I think memory usage caused is measurable (I estimated 4KiB per
dictionary). More importantly, people apparently currently change
the dictionaries we provide and expect the codecs to automatically
pick up the modified mappings. It would be better if the breakage
is explicit (i.e. they get an AttributeError on the variable) instead
of implicit (their changes to the mapping simply have no effect
anymore).

> KOI8-U is not available as mapping on ftp.unicode.org and
> I only recreated codecs from the mapping files available
> there.

I think we should come up with mapping tables for the additional
codecs as well, and maintain them in the CVS. This also applies
to things like rot13.

> I'll rerun the creation with the above changes sometime this
> week.

I hope I can finish my encoding routine shortly, which again
results in changes to the codecs (replacing the encoding dictionaries
with other lookup tables).

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-24 Thread Martin v. Löwis
M.-A. Lemburg wrote:

> I had to create three custom mapping files for cp1140, koi8-u
> and tis-620.

Can you please publish the files you have used somewhere? They
best go into the Python CVS.

Regards,
Martin

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-25 Thread M.-A. Lemburg
Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
> 
> 
>>I had to create three custom mapping files for cp1140, koi8-u
>>and tis-620.
> 
> 
> Can you please publish the files you have used somewhere? They
> best go into the Python CVS.

Sure; I'll check in the whole build machinery I'm using for this.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 25 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-25 Thread M.-A. Lemburg
Martin v. Löwis wrote:
> M.-A. Lemburg wrote:
> 
>>I just left them in because I thought they wouldn't do any harm
>>and might be useful in some applications.
>>
>>Removing them where not directly needed by the codec would not
>>be a problem.
> 
> 
> I think memory usage caused is measurable (I estimated 4KiB per
> dictionary). More importantly, people apparently currently change
> the dictionaries we provide and expect the codecs to automatically
> pick up the modified mappings. It would be better if the breakage
> is explicit (i.e. they get an AttributeError on the variable) instead
> of implicit (their changes to the mapping simply have no effect
> anymore).

Agreed. I've already checked in the changes, BTW.

>>KOI8-U is not available as mapping on ftp.unicode.org and
>>I only recreated codecs from the mapping files available
>>there.
> 
> 
> I think we should come up with mapping tables for the additional
> codecs as well, and maintain them in the CVS. This also applies
> to things like rot13.

Agreed.

>>I'll rerun the creation with the above changes sometime this
>>week.
> 
> 
> I hope I can finish my encoding routine shortly, which again
> results in changes to the codecs (replacing the encoding dictionaries
> with other lookup tables).

Having seen the decode tables written as long Unicode string,
I think that this may indeed also be a good solution for
encoding - the major improvement here is that the parser
and compiler will do the work of creating the table. At
module load time, the .pyc file will only contain a long
string which is very fast to create and load (unlike dictionaries
which are set up dynamically at load time).

In general, it's better to do all the work up-front when
creating the codecs, rather than having run-time code
repeat these tasks over and over again.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 25 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-25 Thread M.-A. Lemburg
M.-A. Lemburg wrote:
> Martin v. Löwis wrote:
> 
>>M.-A. Lemburg wrote:
>>
>>
>>
>>>I had to create three custom mapping files for cp1140, koi8-u
>>>and tis-620.
>>
>>
>>Can you please publish the files you have used somewhere? They
>>best go into the Python CVS.
> 
> 
> Sure; I'll check in the whole build machinery I'm using for this.

Done.

In order to rebuild the codecs, cd Tools/unicode; make
then check the codecs in the created build/ subdir (e.g.
using comparecodecs.py) and copy them over to the
Lib/encodings/ directory.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Oct 25 2005)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try mxODBC.Zope.DA for Windows,Linux,Solaris,FreeBSD for free ! 
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] New codecs checked in

2005-10-25 Thread Martin v. Löwis
M.-A. Lemburg wrote:

> Done.
> 
> In order to rebuild the codecs, cd Tools/unicode; make
> then check the codecs in the created build/ subdir (e.g.
> using comparecodecs.py) and copy them over to the
> Lib/encodings/ directory.

Thanks!

Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com