Re: Python vs. Java gzip performance

2006-03-22 Thread Felipe Almeida Lessa
Em Qua, 2006-03-22 às 00:47 +0100, Martin v. Löwis escreveu:
 Caleb Hattingh wrote:
  What does .readlines() do differently that makes it so much slower
  than .read().splitlines(True)?  To me, the one obvious way to do it
  is .readlines().
[snip]
 Anyway, decompressing the entire file at one lets zlib operate at the
 highest efficiency.

Then there should be a fast-path on readlines like this:

def readlines(self, sizehint=None):
if sizehint is None:
return self.read().splitlines(True)
# ...

Is it okay? Or is there any embedded problem I couldn't see?

-- 
Felipe.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python vs. Java gzip performance

2006-03-22 Thread Martin v. Löwis
Felipe Almeida Lessa wrote:
 def readlines(self, sizehint=None):
   if sizehint is None:
   return self.read().splitlines(True)
   # ...
 
 Is it okay? Or is there any embedded problem I couldn't see?

It's dangerous, if the file is really large - it might exhaust
your memory. Such a setting shouldn't be the default.

Somebody should research what blocking size works best for zipfiles,
and then compare that in performance to read it all at once.

It would be good if the rationale for using at most 100 bytes at
a time could be discovered.

Regards,
Martin

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-21 Thread Caleb Hattingh
Hi Peter

Clearly I misunderstood what Martin was saying :)I was comparing
operations on lines via the file generator against first loading the
file's lines into memory, and then performing the concatenation.

What does .readlines() do differently that makes it so much slower
than .read().splitlines(True)?  To me, the one obvious way to do it
is .readlines().

Caleb

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-21 Thread Martin v. Löwis
Caleb Hattingh wrote:
 What does .readlines() do differently that makes it so much slower
 than .read().splitlines(True)?  To me, the one obvious way to do it
 is .readlines().

readlines reads 100 bytes (at most) at a time. I'm not sure why it
does that (probably in order to not read further ahead than necessary
to get a line (*)), but for gzip, that is terribly inefficient. I
believe the gzip algorithms use a window size much larger than that -
not sure how the gzip library deals with small reads.

One interpretation would be that gzip decompresses the current block
over an over again if the caller only requests 100 bytes each time.
This is a pure guess - you would need to read the zlib source code
to find out.

Anyway, decompressing the entire file at one lets zlib operate at the
highest efficiency.

Regards,
Martin

(*) Guessing further, it might be that read a lot fails to work well 
on a socket, as you would have to wait for the complete data before
even returning the first line.

P.S. Contributions to improve this are welcome.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-17 Thread Martin v. Löwis
Bill wrote:
 The Java version of this code is roughly 2x-3x faster than the Python
 version.  I can get around this problem by replacing the Python
 GzipFile object with a os.popen call to gzcat, but then I sacrifice
 portability.  Is there something that can be improved in the Python
 version?

Don't use readline/readlines. Instead, read in larger chunks, and break
it into lines yourself. For example, if you think the entire file should
fit into memory, read it at once.

If that helps, try editing gzip.py to incorporate that approach.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-17 Thread Caleb Hattingh
I tried this:

from timeit import *

#Try readlines
print Timer('import
gzip;lines=gzip.GzipFile(gztest.txt.gz).readlines();[i+1 for i in
lines]').timeit(200) # This is one line


# Try file object - uses buffering?
print Timer('import gzip;[i+1 for i in
gzip.GzipFile(gztest.txt.gz)]').timeit(200) # This is one line

Produces:

3.90938591957
3.98982691765

Doesn't seem much difference, probably because the test file easily
gets into memory, and so disk buffering has no effect.   The file
gztest.txt.gz is a gzipped file with 1000 lines, each being This is
a test file.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-17 Thread Peter Otten
Caleb Hattingh wrote:

 I tried this:
 
 from timeit import *
 
 #Try readlines
 print Timer('import
 gzip;lines=gzip.GzipFile(gztest.txt.gz).readlines();[i+1 for i in
 lines]').timeit(200) # This is one line
 
 
 # Try file object - uses buffering?
 print Timer('import gzip;[i+1 for i in
 gzip.GzipFile(gztest.txt.gz)]').timeit(200) # This is one line
 
 Produces:
 
 3.90938591957
 3.98982691765
 
 Doesn't seem much difference, probably because the test file easily
 gets into memory, and so disk buffering has no effect.   The file
 gztest.txt.gz is a gzipped file with 1000 lines, each being This is
 a test file.

$ python -cfile('tmp.txt', 'w').writelines('%d This is a test\n' % n for n
in range(1000))
$ gzip tmp.txt

Now, if you follow Martin's advice:

$ python -m timeit -sfrom gzip import GzipFile
GzipFile('tmp.txt.gz').readlines()
10 loops, best of 3: 20.4 msec per loop

$ python -m timeit -sfrom gzip import GzipFile
GzipFile('tmp.txt.gz').read().splitlines(True)
1000 loops, best of 3: 534 usec per loop

Factor 38. Not bad, I'd say :-)

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-17 Thread Andrew MacIntyre
Bill wrote:
 I've written a small program that, in part, reads in a file and parses
 it.  Sometimes, the file is gzipped.  The code that I use to get the
 file object is like so:
 
 if filename.endswith(.gz):
 file = GzipFile(filename)
 else:
 file = open(filename)
 
 Then I parse the contents of the file in the usual way (for line in
 file:...)
 
 The equivalent Java code goes like this:
 
 if (isZipped(aFile)) {
 input = new BufferedReader(new InputStreamReader(new
 GZIPInputStream(new FileInputStream(aFile)));
 } else {
 input = new BufferedReader(new FileReader(aFile));
 }
 
 Then I parse the contents similarly to the Python version (while
 nextLine = input.readLine...)
 
 The Java version of this code is roughly 2x-3x faster than the Python
 version.  I can get around this problem by replacing the Python
 GzipFile object with a os.popen call to gzcat, but then I sacrifice
 portability.  Is there something that can be improved in the Python
 version?

The gzip module is implemented in Python on top of the zlib module.  If
you peruse its source (particularly the readline() method of the GzipFile
class) you might get an idea of what's going on.

popen()ing a gzcat source achieves better performance by shifting the
decompression to an asynchronous execution stream (separate process)
while allowing the standard Python file object's optimised readline()
implementation (in C) to do the line splitting (which is done in Python
code in GzipFile).

I suspect that Java approach probably implements a similar approach
under the covers using threads.

Short of rewriting the gzip module in C, you may get some better
throughput by using a slightly lower level approach to parsing the file:

while 1:
line = z.readline(size=4096)
if not line:
break
...  # process line here

This is probably only likely to be of use for files (such as log files)
with lines longer that the 100 character default in the readline()
method.  More intricate approaches using z.readlines(sizehint=size)
might also work.

If you can afford the memory, approaches that read large chunks from the
gzipped stream then line split in one low level operation (so that the
line splitting is mostly done in C code) are the only way to lift
performance.

To me, if the performance matters, using popen() (or better: the
subprocess module) isn't so bad; it is actually quite portable
except for the dependency on gzip (probably better to use gzip -dc
rather than gzcat to maximise portability though).  gzip is available
for most systems, and the approach is easily modified to use bzip2 as
well (though Python's bz2 module is implemented totally in C, and so
probably doesn't have the performance issues that gzip has).

-
Andrew I MacIntyre These thoughts are mine alone...
E-mail: [EMAIL PROTECTED]  (pref) | Snail: PO Box 370
[EMAIL PROTECTED] (alt) |Belconnen ACT 2616
Web:http://www.andymac.org/   |Australia
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-17 Thread Serge Orlov
Bill wrote:
 Is there something that can be improved in the Python version?

Seems like GzipFile.readlines is not optimized, file.readline works
better:

C:\pypython -c file('tmp.txt', 'w').writelines('%d This is a test\n'
% n for n in range(1))

C:\pypython -m timeit open('tmp.txt').readlines()
100 loops, best of 3: 2.72 msec per loop

C:\pypython -m timeit open('tmp.txt').readlines(100)
100 loops, best of 3: 2.74 msec per loop

C:\pypython -m timeit open('tmp.txt').read().splitlines(True)
100 loops, best of 3: 2.79 msec per loop

Workaround has been posted already.

  -- Serge.

-- 
http://mail.python.org/mailman/listinfo/python-list