Victor Stinner wrote: > Le mercredi 25 mai 2011 à 11:38 +0200, M.-A. Lemburg a écrit : >> You are missing the point: we have StreamReader and StreamWriter APIs >> on codecs to allow each codecs to implement more efficient ways of >> encoding and decoding streams. >> >> Examples of such optimizations are reading the stream in >> chunks that can be decoded in one piece, or writing to the stream >> in a way that doesn't generate encoding state problems on the >> receiving end by ending transmission half-way through a >> shift block. >> >> ... >> >> We don't have many such specialized implementations in the stdlib, >> but this doesn't mean that there's no use for them. It >> just means that developers and users are simply unaware of the >> possibilities opened by these stateful stream APIs. > > Does at least one codec implement such implementation in its > StreamReader or StreamWriter class? And can't we implement such > optimization in incremental encoders and decoders (or in TextIOWrapper)?
I don't see how, since you need control over the file API methods in order to implement such optimizations. OTOH, adding lots of special cases to TextIOWrapper isn't a good either, since these optimizations would then only trigger for a small number of codecs and completely leave out 3rd party codecs. > I checked all multibyte codecs (UTF and CJK codecs) and I don't see any > of such optimization. UTF codecs handle the BOM, but don't have anything > looking like an optimization. CJK codecs use multibytecodec, > MultibyteStreamReader and MultibyteStreamWriter, which don't look to be > optimized. But I missed maybe something? No, you haven't missed such per-codec optimizations. The base classes implement general purpose support for reading from streams in chunks, but the support isn't optimized per codec. For UTF-16 it would e.g. make sense to always read data in blocks with even sizes, removing the trial-and-error decoding and extra buffering currently done by the base classes. For UTF-32, the blocks should have size % 4 == 0. For UTF-8 (and other variable length encodings) it would make sense looking at the end of the (bytes) data read from the stream to see whether a complete code point was read or not, rather than simply running the decoder on the complete data set, only to find that a few bytes at the end are missing. For single character encodings, it would make sense to prefetch data in big chunks and skip all the trial and error decoding implemented by the base classes to address the above problem with variable length encodings. Finally, all this could be implemented in C, reducing the Python call overhead dramatically. > TextIOWrapper has an advanced buffer algorithm to prefetch (readahead) > some bytes at each read to speed up small read. It is difficult to > implement such algorithm, but it's done and it works. > > -- > > Ok, let's stop to speak about theorical optimizations, and let's do a > benchmark to compare codecs and the io modules on reading files! That's somewhat unfair: TextIOWrapper is implemented in C, whereas the StreamReader/Writer subclasses used by the codecs are written in Python. A fair comparison would use the Python implementation of TextIOWrapper. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 25 2011) >>> Python/Zope Consulting and Support ... http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/ ________________________________________________________________________ 2011-05-23: Released eGenix mx Base 3.2.0 http://python.egenix.com/ 2011-05-25: Released mxODBC 3.1.1 http://python.egenix.com/ 2011-06-20: EuroPython 2011, Florence, Italy 26 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com