Re: Python vs. Java gzip performance

2006-03-22 Thread Martin v. Löwis
Felipe Almeida Lessa wrote:
> def readlines(self, sizehint=None):
>   if sizehint is None:
>   return self.read().splitlines(True)
>   # ...
> 
> Is it okay? Or is there any embedded problem I couldn't see?

It's dangerous, if the file is really large - it might exhaust
your memory. Such a setting shouldn't be the default.

Somebody should research what blocking size works best for zipfiles,
and then compare that in performance to "read it all at once".

It would be good if the rationale for using at most 100 bytes at
a time could be discovered.

Regards,
Martin

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-22 Thread Felipe Almeida Lessa
Em Qua, 2006-03-22 às 00:47 +0100, "Martin v. Löwis" escreveu:
> Caleb Hattingh wrote:
> > What does ".readlines()" do differently that makes it so much slower
> > than ".read().splitlines(True)"?  To me, the "one obvious way to do it"
> > is ".readlines()".
[snip]
> Anyway, decompressing the entire file at one lets zlib operate at the
> highest efficiency.

Then there should be a fast-path on readlines like this:

def readlines(self, sizehint=None):
if sizehint is None:
return self.read().splitlines(True)
# ...

Is it okay? Or is there any embedded problem I couldn't see?

-- 
Felipe.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Python vs. Java gzip performance

2006-03-21 Thread Martin v. Löwis
Caleb Hattingh wrote:
> What does ".readlines()" do differently that makes it so much slower
> than ".read().splitlines(True)"?  To me, the "one obvious way to do it"
> is ".readlines()".

readlines reads 100 bytes (at most) at a time. I'm not sure why it
does that (probably in order to not read further ahead than necessary
to get a line (*)), but for gzip, that is terribly inefficient. I
believe the gzip algorithms use a window size much larger than that -
not sure how the gzip library deals with small reads.

One interpretation would be that gzip decompresses the current block
over an over again if the caller only requests 100 bytes each time.
This is a pure guess - you would need to read the zlib source code
to find out.

Anyway, decompressing the entire file at one lets zlib operate at the
highest efficiency.

Regards,
Martin

(*) Guessing further, it might be that "read a lot" fails to work well 
on a socket, as you would have to wait for the complete data before
even returning the first line.

P.S. Contributions to improve this are welcome.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-21 Thread Caleb Hattingh
Hi Peter

Clearly I misunderstood what Martin was saying :)I was comparing
operations on lines via the file generator against first loading the
file's lines into memory, and then performing the concatenation.

What does ".readlines()" do differently that makes it so much slower
than ".read().splitlines(True)"?  To me, the "one obvious way to do it"
is ".readlines()".

Caleb

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-17 Thread Serge Orlov
Bill wrote:
> Is there something that can be improved in the Python version?

Seems like GzipFile.readlines is not optimized, file.readline works
better:

C:\py>python -c "file('tmp.txt', 'w').writelines('%d This is a test\n'
% n for n in range(1))"

C:\py>python -m timeit "open('tmp.txt').readlines()"
100 loops, best of 3: 2.72 msec per loop

C:\py>python -m timeit "open('tmp.txt').readlines(100)"
100 loops, best of 3: 2.74 msec per loop

C:\py>python -m timeit "open('tmp.txt').read().splitlines(True)"
100 loops, best of 3: 2.79 msec per loop

Workaround has been posted already.

  -- Serge.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-17 Thread Andrew MacIntyre
Bill wrote:
> I've written a small program that, in part, reads in a file and parses
> it.  Sometimes, the file is gzipped.  The code that I use to get the
> file object is like so:
> 
> if filename.endswith(".gz"):
> file = GzipFile(filename)
> else:
> file = open(filename)
> 
> Then I parse the contents of the file in the usual way (for line in
> file:...)
> 
> The equivalent Java code goes like this:
> 
> if (isZipped(aFile)) {
> input = new BufferedReader(new InputStreamReader(new
> GZIPInputStream(new FileInputStream(aFile)));
> } else {
> input = new BufferedReader(new FileReader(aFile));
> }
> 
> Then I parse the contents similarly to the Python version (while
> nextLine = input.readLine...)
> 
> The Java version of this code is roughly 2x-3x faster than the Python
> version.  I can get around this problem by replacing the Python
> GzipFile object with a os.popen call to gzcat, but then I sacrifice
> portability.  Is there something that can be improved in the Python
> version?

The gzip module is implemented in Python on top of the zlib module.  If
you peruse its source (particularly the readline() method of the GzipFile
class) you might get an idea of what's going on.

popen()ing a gzcat source achieves better performance by shifting the
decompression to an asynchronous execution stream (separate process)
while allowing the standard Python file object's optimised readline()
implementation (in C) to do the line splitting (which is done in Python
code in GzipFile).

I suspect that Java approach probably implements a similar approach
under the covers using threads.

Short of rewriting the gzip module in C, you may get some better
throughput by using a slightly lower level approach to parsing the file:

while 1:
line = z.readline(size=4096)
if not line:
break
...  # process line here

This is probably only likely to be of use for files (such as log files)
with lines longer that the 100 character default in the readline()
method.  More intricate approaches using z.readlines(sizehint=)
might also work.

If you can afford the memory, approaches that read large chunks from the
gzipped stream then line split in one low level operation (so that the
line splitting is mostly done in C code) are the only way to lift
performance.

To me, if the performance matters, using popen() (or better: the
subprocess module) isn't so bad; it is actually quite portable
except for the dependency on gzip (probably better to use "gzip -dc"
rather than "gzcat" to maximise portability though).  gzip is available
for most systems, and the approach is easily modified to use bzip2 as
well (though Python's bz2 module is implemented totally in C, and so
probably doesn't have the performance issues that gzip has).

-
Andrew I MacIntyre "These thoughts are mine alone..."
E-mail: [EMAIL PROTECTED]  (pref) | Snail: PO Box 370
[EMAIL PROTECTED] (alt) |Belconnen ACT 2616
Web:http://www.andymac.org/   |Australia
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-17 Thread Peter Otten
Caleb Hattingh wrote:

> I tried this:
> 
> from timeit import *
> 
> #Try readlines
> print Timer('import
> gzip;lines=gzip.GzipFile("gztest.txt.gz").readlines();[i+"1" for i in
> lines]').timeit(200) # This is one line
> 
> 
> # Try file object - uses buffering?
> print Timer('import gzip;[i+"1" for i in
> gzip.GzipFile("gztest.txt.gz")]').timeit(200) # This is one line
> 
> Produces:
> 
> 3.90938591957
> 3.98982691765
> 
> Doesn't seem much difference, probably because the test file easily
> gets into memory, and so disk buffering has no effect.   The file
> "gztest.txt.gz" is a gzipped file with 1000 lines, each being "This is
> a test file".

$ python -c"file('tmp.txt', 'w').writelines('%d This is a test\n' % n for n
in range(1000))"
$ gzip tmp.txt

Now, if you follow Martin's advice:

$ python -m timeit -s"from gzip import GzipFile"
"GzipFile('tmp.txt.gz').readlines()"
10 loops, best of 3: 20.4 msec per loop

$ python -m timeit -s"from gzip import GzipFile"
"GzipFile('tmp.txt.gz').read().splitlines(True)"
1000 loops, best of 3: 534 usec per loop

Factor 38. Not bad, I'd say :-)

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-17 Thread Caleb Hattingh
I tried this:

from timeit import *

#Try readlines
print Timer('import
gzip;lines=gzip.GzipFile("gztest.txt.gz").readlines();[i+"1" for i in
lines]').timeit(200) # This is one line


# Try file object - uses buffering?
print Timer('import gzip;[i+"1" for i in
gzip.GzipFile("gztest.txt.gz")]').timeit(200) # This is one line

Produces:

3.90938591957
3.98982691765

Doesn't seem much difference, probably because the test file easily
gets into memory, and so disk buffering has no effect.   The file
"gztest.txt.gz" is a gzipped file with 1000 lines, each being "This is
a test file".

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Python vs. Java gzip performance

2006-03-17 Thread Martin v. Löwis
Bill wrote:
> The Java version of this code is roughly 2x-3x faster than the Python
> version.  I can get around this problem by replacing the Python
> GzipFile object with a os.popen call to gzcat, but then I sacrifice
> portability.  Is there something that can be improved in the Python
> version?

Don't use readline/readlines. Instead, read in larger chunks, and break
it into lines yourself. For example, if you think the entire file should
fit into memory, read it at once.

If that helps, try editing gzip.py to incorporate that approach.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list