Is there a faster way to transcode from 8-bit chars (charmaps) to utf-8 than going through unicode()?
I'm writing a small card-file program. As a test, I use a 53 MB MBox file, in mac-roman encoding. My program reads and parses the file into messages in about 3 to 5 seconds (Wow! Go Python!), but takes about 14 seconds to iterate over the cards and convert them to utf-8: for i in xrange(len(cards)): u = unicode(cards[i], encoding) cards[i] = u.encode('utf-8') The time is nearly all in the unicode() call. It's not so much how much time it takes, but that it takes 4 times as long as the real work, just to do table lookups. Looking at the source (which, if I have it right, is PyUnicode_DecodeCharmap() in unicodeobject.c), I think it is doing a dictionary lookup for each character. I would have thought that it would make and cache a LUT the size of the charmap (and hook the relevent dictionary stuff to delete the cached LUT if the dictionary is changed). (You may consider this a request for enhancement. ;) I thought of using U"".translate(), but the unicode version is defined to be slow, and anyway I can't find any way to just shove my 8-bit data into a unicode string without translation. Is there some similar approach? I'm almost (but not quite) ready to try it in Pyrex. I'm new to Python. I didn't google anything relevent on python.org or in groups. I posted this in comp.lang.python yesterday, got a couple of responses, but I think this may be too sophisticated a question for that group. I'm not a member of this list, so please copy me on replies so I don't have to hunt them down in the archive. ____________________________________________________________________ TonyN.:' <mailto:[EMAIL PROTECTED]> ' <http://www.georgeanelson.com/> _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com