Bill wrote: > I've written a small program that, in part, reads in a file and parses > it. Sometimes, the file is gzipped. The code that I use to get the > file object is like so: > > if filename.endswith(".gz"): > file = GzipFile(filename) > else: > file = open(filename) > > Then I parse the contents of the file in the usual way (for line in > file:...) > > The equivalent Java code goes like this: > > if (isZipped(aFile)) { > input = new BufferedReader(new InputStreamReader(new > GZIPInputStream(new FileInputStream(aFile))); > } else { > input = new BufferedReader(new FileReader(aFile)); > } > > Then I parse the contents similarly to the Python version (while > nextLine = input.readLine...) > > The Java version of this code is roughly 2x-3x faster than the Python > version. I can get around this problem by replacing the Python > GzipFile object with a os.popen call to gzcat, but then I sacrifice > portability. Is there something that can be improved in the Python > version?
The gzip module is implemented in Python on top of the zlib module. If you peruse its source (particularly the readline() method of the GzipFile class) you might get an idea of what's going on. popen()ing a gzcat source achieves better performance by shifting the decompression to an asynchronous execution stream (separate process) while allowing the standard Python file object's optimised readline() implementation (in C) to do the line splitting (which is done in Python code in GzipFile). I suspect that Java approach probably implements a similar approach under the covers using threads. Short of rewriting the gzip module in C, you may get some better throughput by using a slightly lower level approach to parsing the file: while 1: line = z.readline(size=4096) if not line: break ... # process line here This is probably only likely to be of use for files (such as log files) with lines longer that the 100 character default in the readline() method. More intricate approaches using z.readlines(sizehint=<size>) might also work. If you can afford the memory, approaches that read large chunks from the gzipped stream then line split in one low level operation (so that the line splitting is mostly done in C code) are the only way to lift performance. To me, if the performance matters, using popen() (or better: the subprocess module) isn't so bad; it is actually quite portable except for the dependency on gzip (probably better to use "gzip -dc" rather than "gzcat" to maximise portability though). gzip is available for most systems, and the approach is easily modified to use bzip2 as well (though Python's bz2 module is implemented totally in C, and so probably doesn't have the performance issues that gzip has). ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: [EMAIL PROTECTED] (pref) | Snail: PO Box 370 [EMAIL PROTECTED] (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia -- http://mail.python.org/mailman/listinfo/python-list