Re: [Python-Dev] Unicode charmap decoders slow

2005-10-17 Thread Tony Nelson
At 11:56 AM +0200 10/16/05, Martin v. Löwis wrote: >Tony Nelson wrote: >> BTW, Martin, if you care to, would you explain to me how a Trie would be >> used for charmap encoding? I know a couple of approaches, but I don't know >> how to do it fast. (I've never actually had the occasion to use a Tri

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-16 Thread Martin v. Löwis
Tony Nelson wrote: > Umm, 0 (NUL) is a valid output character in most of the 8-bit character > sets. It could be handled by having a separate "exceptions" string of the > unicode code points that actually map to the exception char. Yes. But only U+ should normally map to 0. It could be specia

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-16 Thread Martin v. Löwis
Tony Nelson wrote: > BTW, Martin, if you care to, would you explain to me how a Trie would be > used for charmap encoding? I know a couple of approaches, but I don't know > how to do it fast. (I've never actually had the occasion to use a Trie.) I currently envision a three-level trie, with 5, 4

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-15 Thread Tony Nelson
I have put up a new, packaged version of my fast charmap module at . Hopefully it is packaged properly and works properly (it works on my FC3 Python 2.3.4 system). This version is over 5 times faster than the base codec according to Hye-Shik Chang's benchmar

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-14 Thread Martin v. Löwis
Walter Dörwald wrote: > Of course we can't accept Pyrex code in the Python core, so it would be > great to rewrite the encoder as a patch to PyUnicode_EncodeCharmap(). > This version must be able to cope with encoding tables that are random > strings without crashing. I don't think this will be

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-14 Thread M.-A. Lemburg
Walter Dörwald wrote: > We've already taken care of decoding. What we still need is a new > gencodec.py and regenerated codecs. I'll take care of that; just haven't gotten around to it yet. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Oct 14 2005

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-14 Thread Walter Dörwald
Martin v. Löwis wrote: > Tony Nelson wrote: > >> I have written my fastcharmap decoder and encoder. It's not meant to be >> better than the patch and other changes to come in a future version of >> Python, but it does work now with the current codecs. > > It's an interesting solution. I like t

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-14 Thread Martin v. Löwis
Tony Nelson wrote: > I have written my fastcharmap decoder and encoder. It's not meant to be > better than the patch and other changes to come in a future version of > Python, but it does work now with the current codecs. It's an interesting solution. > To use, hook each codec to be speed up: >

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-13 Thread Tony Nelson
I have written my fastcharmap decoder and encoder. It's not meant to be better than the patch and other changes to come in a future version of Python, but it does work now with the current codecs. Using Hye-Shik Chang's benchmark, decoding is about 4.3x faster than the base, and encoding is about

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-06 Thread Tony Nelson
At 8:36 AM +0200 10/5/05, Martin v. Löwis wrote: >Tony Nelson wrote: ... >> Encoding can be made fast using a simple hash table with external chaining. >> There are max 256 codepoints to encode, and they will normally be well >> distributed in their lower 8 bits. Hash on the low 8 bits (just mask

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-06 Thread Walter Dörwald
M.-A. Lemburg wrote: > [...] >>Or we could have a function that recreates the dictionary from the string. > > Actually, I'd prefer that these operations be done by the > codec generator script, so that we don't have additional > startup time. The dictionaries should then no longer be > generated

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-06 Thread Hye-Shik Chang
On 10/6/05, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: > Hye-Shik Chang wrote: > > (encoding, fastmap codec) > > > > % ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10_fc'; > > u=unicode(s, e)" "u.encode(e)" > > 1000 loops, best of 3: 536 usec per loop > > > > (encoding, utf-8 codec) > > > > %

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-06 Thread M.-A. Lemburg
Hye-Shik Chang wrote: > On 10/6/05, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: > >>Hye-Shik, could you please provide some timeit figures for >>the fastmap encoding ? >> Thanks for the timings. > (before applying Walter's patch, charmap decoder) > > % ./python Lib/timeit.py -s "s='a'*53*1024; e='

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-06 Thread M.-A. Lemburg
Walter Dörwald wrote: > Martin v. Löwis wrote: > >> Hye-Shik Chang wrote: >> >>> If the encoding optimization can be easily done in Walter's approach, >>> the fastmap codec would be too expensive way for the objective because >>> we must maintain not only fastmap but also charmap for backward >>>

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-06 Thread Walter Dörwald
Martin v. Löwis wrote: > Hye-Shik Chang wrote: > >> If the encoding optimization can be easily done in Walter's approach, >> the fastmap codec would be too expensive way for the objective because >> we must maintain not only fastmap but also charmap for backward >> compatibility. > > IMO, whethe

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-06 Thread Walter Dörwald
Martin v. Löwis wrote: > Walter Dörwald wrote: > >> OK, here's a patch that implements this enhancement to >> PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939 > > Looks nice! > >> Creating the decoding_map as a string should probably be done by >> gencodec.py directly. This way the

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-06 Thread Martin v. Löwis
Hye-Shik Chang wrote: > If the encoding optimization can be easily done in Walter's approach, > the fastmap codec would be too expensive way for the objective because > we must maintain not only fastmap but also charmap for backward > compatibility. IMO, whether a new function is added or whether

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread Hye-Shik Chang
On 10/6/05, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: > Hye-Shik, could you please provide some timeit figures for > the fastmap encoding ? > (before applying Walter's patch, charmap decoder) % ./python Lib/timeit.py -s "s='a'*53*1024; e='iso8859_10'; u=unicode(s, e)" "s.decode(e)" 100 loops, best

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread Martin v. Löwis
Trent Mick wrote: > [Martin v. Loewis wrote] > >>Maybe it is possible to hack up a project file to invoke distutils >>as the build process, but no such project file is currently available, >>nor is it known whether it is possible to create one. > > > This is essentially what the "_ssl" project

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread Trent Mick
[Martin v. Loewis wrote] > Maybe it is possible to hack up a project file to invoke distutils > as the build process, but no such project file is currently available, > nor is it known whether it is possible to create one. This is essentially what the "_ssl" project does, no? It defers to "build_

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread M.-A. Lemburg
Martin v. Löwis wrote: > M.-A. Lemburg wrote: > >>> It wouldn't. >> >> >> >> Could you elaborate why not ? Using distutils on Windows is really >> easy... > > > The current build process for Windows simply doesn't provide it. > You expect to select "Build/All" from the menu (or some such), > and

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread Martin v. Löwis
M.-A. Lemburg wrote: >>It wouldn't. > > > Could you elaborate why not ? Using distutils on Windows is really > easy... The current build process for Windows simply doesn't provide it. You expect to select "Build/All" from the menu (or some such), and expect all code to be compiled. The VC build

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread M.-A. Lemburg
Martin v. Löwis wrote: > Walter Dörwald wrote: > >>OK, here's a patch that implements this enhancement to >>PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939 > > Looks nice! Indeed (except for the choice of the "map this character to undefined" code point). Hye-Shik, could you please

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread M.-A. Lemburg
Martin v. Löwis wrote: > M.-A. Lemburg wrote: > >>> I would try to avoid generating C code at all costs. Maintaining the >>> build processes will just be a nightmare. >> >> >> >> We could automate this using distutils; however I'm not sure >> whether this would then also work on Windows. > > > I

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread Martin v. Löwis
Walter Dörwald wrote: > OK, here's a patch that implements this enhancement to > PyUnicode_DecodeCharmap(): http://www.python.org/sf/1313939 Looks nice! > Creating the decoding_map as a string should probably be done by > gencodec.py directly. This way the first import of the codec would be >

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread Martin v. Löwis
M.-A. Lemburg wrote: >>I would try to avoid generating C code at all costs. Maintaining the >>build processes will just be a nightmare. > > > We could automate this using distutils; however I'm not sure > whether this would then also work on Windows. It wouldn't. Regards, Martin _

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread M.-A. Lemburg
Hye-Shik Chang wrote: > On 10/5/05, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: > >>Of course, a C version could use the same approach as >>the unicodedatabase module: that of compressed lookup >>tables... >> >>http://aggregate.org/TechPub/lcpc2002.pdf >> >>genccodec.py anyone ? >> > > > I

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread Walter Dörwald
Martin v. Löwis wrote: > Tony Nelson wrote: > >>> For decoding it should be sufficient to use a unicode string of >>> length 256. u"\ufffd" could be used for "maps to undefined". Or the >>> string might be shorter and byte values greater than the length of >>> the string are treated as "maps to u

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread Hye-Shik Chang
On 10/5/05, M.-A. Lemburg <[EMAIL PROTECTED]> wrote: > Of course, a C version could use the same approach as > the unicodedatabase module: that of compressed lookup > tables... > > http://aggregate.org/TechPub/lcpc2002.pdf > > genccodec.py anyone ? > I had written a test codec for single b

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread jepler
The function the module below, xlate.xlate, doesn't quite do what "".decode does. (mostly that characters that don't exist are mapped to u+fffd always, instead of having the various behaviors avilable to "".decode) It builds the fast decoding structure once per call, but when decoding 53kb of dat

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread M.-A. Lemburg
Martin v. Löwis wrote: >>Another option would be to generate a big switch statement in C >>and let the compiler decide about the best data structure. > > I would try to avoid generating C code at all costs. Maintaining the > build processes will just be a nightmare. We could automate this usi

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-05 Thread Walter Dörwald
Am 05.10.2005 um 00:08 schrieb Martin v. Löwis: > Walter Dörwald wrote: > >>> This array would have to be sparse, of course. >>> >> For encoding yes, for decoding no. >> > [...] > >> For decoding it should be sufficient to use a unicode string of >> length 256. u"\ufffd" could be used for "maps

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Martin v. Löwis
Tony Nelson wrote: > But is there really no way to say this fast in pure Python? The way a > one-to-one byte mapping can be done with "".translate()? Well, .translate isn't exactly pure Python. One-to-one between bytes and Unicode code points simply can't work. Just try all alternatives yoursel

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Martin v. Löwis
Tony Nelson wrote: >>For decoding it should be sufficient to use a unicode string of >>length 256. u"\ufffd" could be used for "maps to undefined". Or the >>string might be shorter and byte values greater than the length of >>the string are treated as "maps to undefined" too. > > > With Unicode u

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Tony Nelson
[Recipient list not trimmed, as my replies must be vetted by a moderator, which seems to delay them. :] At 11:48 PM +0200 10/4/05, Walter Dörwald wrote: >Am 04.10.2005 um 21:50 schrieb Martin v. Löwis: > >> Walter Dörwald wrote: >> >>> For charmap decoding we might be able to use an array (e.g. a

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Tony Nelson
At 9:37 AM +0200 10/4/05, Walter Dörwald wrote: >Am 04.10.2005 um 04:25 schrieb [EMAIL PROTECTED]: > >>As the OP suggests, decoding with a codec like mac-roman or iso8859-1 is >>very slow compared to encoding or decoding with utf-8. Here I'm working >>with 53k of data instead of 53 megs. (Note: thi

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Martin v. Löwis
Walter Dörwald wrote: >> This array would have to be sparse, of course. > > > For encoding yes, for decoding no. [...] > For decoding it should be sufficient to use a unicode string of length > 256. u"\ufffd" could be used for "maps to undefined". Or the string > might be shorter and byte val

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Walter Dörwald
Am 04.10.2005 um 21:50 schrieb Martin v. Löwis: > Walter Dörwald wrote: > >> For charmap decoding we might be able to use an array (e.g. a >> tuple (or an array.array?) of codepoints instead of dictionary. >> > > This array would have to be sparse, of course. For encoding yes, for decoding no.

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread M.-A. Lemburg
Walter Dörwald wrote: > Am 04.10.2005 um 04:25 schrieb [EMAIL PROTECTED]: > > >>As the OP suggests, decoding with a codec like mac-roman or >>iso8859-1 is very >>slow compared to encoding or decoding with utf-8. Here I'm working >>with 53k of >>data instead of 53 megs. (Note: this is a lapt

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Martin v. Löwis
Walter Dörwald wrote: > For charmap decoding we might be able to use an array (e.g. a tuple > (or an array.array?) of codepoints instead of dictionary. This array would have to be sparse, of course. Using an array.array would be more efficient, I guess - but we would need a C API for arrays (to

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-04 Thread Walter Dörwald
Am 04.10.2005 um 04:25 schrieb [EMAIL PROTECTED]: > As the OP suggests, decoding with a codec like mac-roman or > iso8859-1 is very > slow compared to encoding or decoding with utf-8. Here I'm working > with 53k of > data instead of 53 megs. (Note: this is a laptop, so it's possible > that

Re: [Python-Dev] Unicode charmap decoders slow

2005-10-03 Thread jepler
As the OP suggests, decoding with a codec like mac-roman or iso8859-1 is very slow compared to encoding or decoding with utf-8. Here I'm working with 53k of data instead of 53 megs. (Note: this is a laptop, so it's possible that thermal or battery management features affected these numbers a bit,

[Python-Dev] Unicode charmap decoders slow

2005-10-03 Thread Tony Nelson
Is there a faster way to transcode from 8-bit chars (charmaps) to utf-8 than going through unicode()? I'm writing a small card-file program. As a test, I use a 53 MB MBox file, in mac-roman encoding. My program reads and parses the file into messages in about 3 to 5 seconds (Wow! Go Python!), but