Le mercredi 25 mai 2011 à 15:43 +0200, M.-A. Lemburg a écrit : > For UTF-16 it would e.g. make sense to always read data in blocks > with even sizes, removing the trial-and-error decoding and extra > buffering currently done by the base classes. For UTF-32, the > blocks should have size % 4 == 0. > > For UTF-8 (and other variable length encodings) it would make > sense looking at the end of the (bytes) data read from the > stream to see whether a complete code point was read or not, > rather than simply running the decoder on the complete data > set, only to find that a few bytes at the end are missing.
I think that the readahead algorithm is much more faster than trying to avoid partial input, and it's not a problem to have partial input if you use an incremental decoder. > For single character encodings, it would make sense to prefetch > data in big chunks and skip all the trial and error decoding > implemented by the base classes to address the above problem > with variable length encodings. TextIOWrapper implements this optimization using its readahead algorithm. > That's somewhat unfair: TextIOWrapper is implemented in C, > whereas the StreamReader/Writer subclasses used by the > codecs are written in Python. > > A fair comparison would use the Python implementation of > TextIOWrapper. Do you mean that you would like to reimplement codecs in C? It is not revelant to compare codecs and _pyio, because codecs reuses BufferedReader (of the io module, not of the _pyio module), and io is the main I/O module of Python 3. But well, as you want, here is a benchmark comparing: _pyio.TextIOWrapper(io.open(filename, 'rb'), encoding) and codecs.open(filename, encoding) The only change with my previous bench.py script is the test_io() function : def test_io(test_func, chunk_size): with open(FILENAME, 'rb') as buffered: f = _pyio.TextIOWrapper(buffered, ENCODING) test_file(f, test_func, chunk_size) f.close() (1) Decode Objects/unicodeobject.c (317336 characters) from utf-8 test_io.readline(): 1193.4 ms test_codecs.readline(): 1267.9 ms -> codecs 6% slower than io test_io.read(1): 21696.4 ms test_codecs.read(1): 36027.2 ms -> codecs 66% slower than io test_io.read(100): 3080.7 ms test_codecs.read(100): 3901.7 ms -> codecs 27% slower than io test_io.read(): 3991.0 ms test_codecs.read(): 1736.9 ms -> codecs 130% FASTER than io (2) Decode README (6613 characters) from ascii test_io.readline(): 678.1 ms test_codecs.readline(): 760.5 ms -> codecs 12% slower than io test_io.read(1): 13533.2 ms test_codecs.read(1): 21900.0 ms -> codecs 62% slower than io test_io.read(100): 2663.1 ms test_codecs.read(100): 3270.1 ms -> codecs 23% slower than io test_io.read(): 6769.1 ms test_codecs.read(): 3919.6 ms -> codecs 73% FASTER than io (3) Decode Lib/test/cjkencodings/gb18030.txt (501 characters) from gb18030 test_io.readline(): 38.9 ms test_codecs.readline(): 15.1 ms -> codecs 157% FASTER than io test_io.read(1): 369.8 ms test_codecs.read(1): 302.2 ms -> codecs 22% FASTER than io test_io.read(100): 258.2 ms test_codecs.read(100): 155.1 ms -> codecs 67% FASTER than io test_io.read(): 1803.2 ms test_codecs.read(): 1002.9 ms -> codecs 80% FASTER than io _pyio.TextIOWrapper is faster than codecs.StreamReader for readline(), read(1) and read(100), with ASCII and UTF-8. It is slower for gb18030. As in the io vs codecs benchmark, codecs.StreamReader is always faster than _pyio.TextIOWrapper for read(). Victor _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com