[Recipient list not trimmed, as my replies must be vetted by a moderator, which seems to delay them. :]
At 11:48 PM +0200 10/4/05, Walter Dörwald wrote: >Am 04.10.2005 um 21:50 schrieb Martin v. Löwis: > >> Walter Dörwald wrote: >> >>> For charmap decoding we might be able to use an array (e.g. a >>> tuple (or an array.array?) of codepoints instead of dictionary. >>> >> >> This array would have to be sparse, of course. > >For encoding yes, for decoding no. > >> Using an array.array would be more efficient, I guess - but we would >> need a C API for arrays (to validate the type code, and to get ob_item). > >For decoding it should be sufficient to use a unicode string of >length 256. u"\ufffd" could be used for "maps to undefined". Or the >string might be shorter and byte values greater than the length of >the string are treated as "maps to undefined" too. With Unicode using more than 64K codepoints now, it might be more forward looking to use a table of 256 32-bit values, with no need for tricky values. There is no need to add any C code to the codecs; just add some more code to the existing C function (which, if I have it right, is PyUnicode_DecodeCharmap() in unicodeobject.c). ... >> For encoding, having a C trie might give considerable speedup. _codecs >> could offer an API to convert the current dictionaries into >> lookup-efficient structures, and the conversion would be done when >> importing the codec. >> >> For the trie, two levels (higher and lower byte) would probably be >> sufficient: I believe most encodings only use 2 "rows" (256 code >> point blocks), very few more than three. > >This might work, although nobody has complained about charmap >encoding yet. Another option would be to generate a big switch >statement in C and let the compiler decide about the best data >structure. I'm willing to complain. :) I might allow saving of my (53 MB) MBox file. (Not that editing received mail makes as much sense as searching it.) Encoding can be made fast using a simple hash table with external chaining. There are max 256 codepoints to encode, and they will normally be well distributed in their lower 8 bits. Hash on the low 8 bits (just mask), and chain to an area with 256 entries. Modest storage, normally short chains, therefore fast encoding. At 12:08 AM +0200 10/5/05, Martin v. Löwis wrote: >I would try to avoid generating C code at all costs. Maintaining the >build processes will just be a nightmare. I agree; also I don't think the generated codecs need to be changed at all. All the changes can be made to the existing C functions, by adding caching per a reply of mine that hasn't made it to the list yet. Well, OK, something needs to hook writes to the codec's dictionary, but I /think/ that only needs Python code. I say: >...I suggest instead just /caching/ the translation in C arrays stored >with the codec object. The cache would be invalidated on any write to the >codec's mapping dictionary, and rebuilt the next time anything was >translated. This would maintain the present semantics, work with current >codecs, and still provide the desired speed improvement. Note that this caching is done by new code added to the existing C functions (which, if I have it right, are in unicodeobject.c). No architectural changes are made; no existing codecs need to be changed; everything will just work, and usually work faster, with very modest memory requirements of one 256 entry array of 32-bit Unicode values and a hash table with 256 1-byte slots and 256 chain entries, each having a 4 byte Unicode value, a byte output value, a byte chain index, and probably 2 bytes of filler, for a hash table size of 2304 bytes per codec. ____________________________________________________________________ TonyN.:' <mailto:[EMAIL PROTECTED]> ' <http://www.georgeanelson.com/> _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com