Re: Python vs. Java gzip performance
Felipe Almeida Lessa wrote: > def readlines(self, sizehint=None): > if sizehint is None: > return self.read().splitlines(True) > # ... > > Is it okay? Or is there any embedded problem I couldn't see? It's dangerous, if the file is really large - it might exhaust your memory. Such a setting shouldn't be the default. Somebody should research what blocking size works best for zipfiles, and then compare that in performance to "read it all at once". It would be good if the rationale for using at most 100 bytes at a time could be discovered. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: Python vs. Java gzip performance
Em Qua, 2006-03-22 às 00:47 +0100, "Martin v. Löwis" escreveu: > Caleb Hattingh wrote: > > What does ".readlines()" do differently that makes it so much slower > > than ".read().splitlines(True)"? To me, the "one obvious way to do it" > > is ".readlines()". [snip] > Anyway, decompressing the entire file at one lets zlib operate at the > highest efficiency. Then there should be a fast-path on readlines like this: def readlines(self, sizehint=None): if sizehint is None: return self.read().splitlines(True) # ... Is it okay? Or is there any embedded problem I couldn't see? -- Felipe. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python vs. Java gzip performance
Caleb Hattingh wrote: > What does ".readlines()" do differently that makes it so much slower > than ".read().splitlines(True)"? To me, the "one obvious way to do it" > is ".readlines()". readlines reads 100 bytes (at most) at a time. I'm not sure why it does that (probably in order to not read further ahead than necessary to get a line (*)), but for gzip, that is terribly inefficient. I believe the gzip algorithms use a window size much larger than that - not sure how the gzip library deals with small reads. One interpretation would be that gzip decompresses the current block over an over again if the caller only requests 100 bytes each time. This is a pure guess - you would need to read the zlib source code to find out. Anyway, decompressing the entire file at one lets zlib operate at the highest efficiency. Regards, Martin (*) Guessing further, it might be that "read a lot" fails to work well on a socket, as you would have to wait for the complete data before even returning the first line. P.S. Contributions to improve this are welcome. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python vs. Java gzip performance
Hi Peter Clearly I misunderstood what Martin was saying :)I was comparing operations on lines via the file generator against first loading the file's lines into memory, and then performing the concatenation. What does ".readlines()" do differently that makes it so much slower than ".read().splitlines(True)"? To me, the "one obvious way to do it" is ".readlines()". Caleb -- http://mail.python.org/mailman/listinfo/python-list
Re: Python vs. Java gzip performance
Bill wrote: > Is there something that can be improved in the Python version? Seems like GzipFile.readlines is not optimized, file.readline works better: C:\py>python -c "file('tmp.txt', 'w').writelines('%d This is a test\n' % n for n in range(1))" C:\py>python -m timeit "open('tmp.txt').readlines()" 100 loops, best of 3: 2.72 msec per loop C:\py>python -m timeit "open('tmp.txt').readlines(100)" 100 loops, best of 3: 2.74 msec per loop C:\py>python -m timeit "open('tmp.txt').read().splitlines(True)" 100 loops, best of 3: 2.79 msec per loop Workaround has been posted already. -- Serge. -- http://mail.python.org/mailman/listinfo/python-list
Re: Python vs. Java gzip performance
Bill wrote: > I've written a small program that, in part, reads in a file and parses > it. Sometimes, the file is gzipped. The code that I use to get the > file object is like so: > > if filename.endswith(".gz"): > file = GzipFile(filename) > else: > file = open(filename) > > Then I parse the contents of the file in the usual way (for line in > file:...) > > The equivalent Java code goes like this: > > if (isZipped(aFile)) { > input = new BufferedReader(new InputStreamReader(new > GZIPInputStream(new FileInputStream(aFile))); > } else { > input = new BufferedReader(new FileReader(aFile)); > } > > Then I parse the contents similarly to the Python version (while > nextLine = input.readLine...) > > The Java version of this code is roughly 2x-3x faster than the Python > version. I can get around this problem by replacing the Python > GzipFile object with a os.popen call to gzcat, but then I sacrifice > portability. Is there something that can be improved in the Python > version? The gzip module is implemented in Python on top of the zlib module. If you peruse its source (particularly the readline() method of the GzipFile class) you might get an idea of what's going on. popen()ing a gzcat source achieves better performance by shifting the decompression to an asynchronous execution stream (separate process) while allowing the standard Python file object's optimised readline() implementation (in C) to do the line splitting (which is done in Python code in GzipFile). I suspect that Java approach probably implements a similar approach under the covers using threads. Short of rewriting the gzip module in C, you may get some better throughput by using a slightly lower level approach to parsing the file: while 1: line = z.readline(size=4096) if not line: break ... # process line here This is probably only likely to be of use for files (such as log files) with lines longer that the 100 character default in the readline() method. More intricate approaches using z.readlines(sizehint=) might also work. If you can afford the memory, approaches that read large chunks from the gzipped stream then line split in one low level operation (so that the line splitting is mostly done in C code) are the only way to lift performance. To me, if the performance matters, using popen() (or better: the subprocess module) isn't so bad; it is actually quite portable except for the dependency on gzip (probably better to use "gzip -dc" rather than "gzcat" to maximise portability though). gzip is available for most systems, and the approach is easily modified to use bzip2 as well (though Python's bz2 module is implemented totally in C, and so probably doesn't have the performance issues that gzip has). - Andrew I MacIntyre "These thoughts are mine alone..." E-mail: [EMAIL PROTECTED] (pref) | Snail: PO Box 370 [EMAIL PROTECTED] (alt) |Belconnen ACT 2616 Web:http://www.andymac.org/ |Australia -- http://mail.python.org/mailman/listinfo/python-list
Re: Python vs. Java gzip performance
Caleb Hattingh wrote: > I tried this: > > from timeit import * > > #Try readlines > print Timer('import > gzip;lines=gzip.GzipFile("gztest.txt.gz").readlines();[i+"1" for i in > lines]').timeit(200) # This is one line > > > # Try file object - uses buffering? > print Timer('import gzip;[i+"1" for i in > gzip.GzipFile("gztest.txt.gz")]').timeit(200) # This is one line > > Produces: > > 3.90938591957 > 3.98982691765 > > Doesn't seem much difference, probably because the test file easily > gets into memory, and so disk buffering has no effect. The file > "gztest.txt.gz" is a gzipped file with 1000 lines, each being "This is > a test file". $ python -c"file('tmp.txt', 'w').writelines('%d This is a test\n' % n for n in range(1000))" $ gzip tmp.txt Now, if you follow Martin's advice: $ python -m timeit -s"from gzip import GzipFile" "GzipFile('tmp.txt.gz').readlines()" 10 loops, best of 3: 20.4 msec per loop $ python -m timeit -s"from gzip import GzipFile" "GzipFile('tmp.txt.gz').read().splitlines(True)" 1000 loops, best of 3: 534 usec per loop Factor 38. Not bad, I'd say :-) Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: Python vs. Java gzip performance
I tried this: from timeit import * #Try readlines print Timer('import gzip;lines=gzip.GzipFile("gztest.txt.gz").readlines();[i+"1" for i in lines]').timeit(200) # This is one line # Try file object - uses buffering? print Timer('import gzip;[i+"1" for i in gzip.GzipFile("gztest.txt.gz")]').timeit(200) # This is one line Produces: 3.90938591957 3.98982691765 Doesn't seem much difference, probably because the test file easily gets into memory, and so disk buffering has no effect. The file "gztest.txt.gz" is a gzipped file with 1000 lines, each being "This is a test file". -- http://mail.python.org/mailman/listinfo/python-list
Re: Python vs. Java gzip performance
Bill wrote: > The Java version of this code is roughly 2x-3x faster than the Python > version. I can get around this problem by replacing the Python > GzipFile object with a os.popen call to gzcat, but then I sacrifice > portability. Is there something that can be improved in the Python > version? Don't use readline/readlines. Instead, read in larger chunks, and break it into lines yourself. For example, if you think the entire file should fit into memory, read it at once. If that helps, try editing gzip.py to incorporate that approach. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list