Michael Fox added the comment:
I thought about it some more and the only bug here is mine, failing to
explicitly set mode='rt'.
Maybe back when someone invented text and binary modes they should
have been clear which was to be the default for all things. Maybe when
someone made the
Michael Fox added the comment:
I thought of an even more hazardous case:
if compression == 'gz':
import gzip
open = gzip.open
elif compression == 'xz':
import lzma
open = lzma.open
else:
pass
On Mon, May 20, 2013 at 9:41 AM, Michael Fox wrote:
>
Michael Fox added the comment:
You're right. In fact, what doesn't make sense is to be doing
line-oriented reads on a binary file. Why was I doing that?
I do have another quibble though. The open() function is like this:
open(file, mode='r', buffering=-1, encoding=None,
Michael Fox added the comment:
I was thinking about this line:
end = self._buffer.find(b"\n", self._buffer_offset) + 1
Might be a bug? For example, is there a unicode where one of several
bytes is '\n'? In this case it splits the line in the middle of a
character, right?
Michael Fox added the comment:
io.BufferedReader works well for me. Thanks for the good suggestion.
Now python 3.3 and 3.4 have similar performance to each other and they
are only 2x slower than pyliblzma.
>From my perspective default wrapping with io.BufferedReader is a great
idea. I ca
Michael Fox added the comment:
I looked into it a little and it looks like pyliblzma is a pure C
extension whereas new lzma library wraps liblzma but the rest is
python. In particular this happens for every line:
if size < 0:
end = self._buffer.find(b"\
Michael Fox added the comment:
3.4 is much better but still 4x slower than 2.7
m@air:~/q/topaz/parse_datalog$ time python2.7 lzmaperf.py
102368
real0m0.053s
user0m0.052s
sys 0m0.000s
m@air:~/q/topaz/parse_datalog$ time
~/tmp/cpython-23836f17e4a2/bin/python3.4 lzmaperf.py
102368
New submission from Michael Fox:
import lzma
count = 0
f = lzma.LZMAFile('bigfile.xz' ,'r')
for line in f:
count += 1
print(count)
Comparing python2 with pyliblzma to python3.3.1 with the new lzma:
m@air:~/q/topaz/parse_datalog$ time python lzmaperf.py
102368
r