Hi David, Thanks for the below solutions: most illuminating. I implemented your previous message suggestions, and already the processing time (on my datasets) is within acceptable human times. I'll try your suggestion below. Thanks again. Ron.
> -----Original Message----- > From: David Bolen [mailto:db3l....@gmail.com] > Sent: Tuesday, May 26, 2009 03:56 > To: python-list@python.org > Subject: Re: A fast way to read last line of gzip archive ? > > "Barak, Ron" <ron.ba...@lsi.com> writes: > > > I couldn't really go with the shell utilities approach, as > I have no > > say in my user environment, and thus cannot assume which > binaries are > > install on the user's machine. > > I suppose if you knew your target you could just supply the > external binaries to go with your application, but I agree > that would probably be more of a pain than its worth for the > performance gain in real world time. > > > I'll try and implement your last suggestion, and see if the > > performance is acceptable to (human) users. > > In terms of tuning the third option a bit, I'd play with the > tracking of the final two chunk (as mentioned in my first > response), perhaps shrinking the chunk size or only > processing a smaller chunk of it for lines (assuming a > reasonable line size) to minimize the final loop. > You could also try using splitlines() on the final buffer > rather than a StringIO wrapper, although that'll have a > memory hit for the constructed list but doing a small portion > of the buffer would minimize that. > > I was curious what I could actually achieve, so here are > three variants that I came up with. > > First, this just fine tunes slightly tracking the chunks and > then only processes enough final data based on anticipated > maximum length length (so if the final line is longer than > that you'll only get the final MAX_LINE bytes of that line). > I also found I got better performance using a smaller 1024 > chunk size with GZipFile.read() than a MB - not entirely sure > why although it perhaps matches the internal buffer size > better: > > # last-chunk-2.py > > import gzip > import sys > > CHUNK_SIZE = 1024 > MAX_LINE = 255 > > in_file = gzip.open(sys.argv[1],'r') > > chunk = prior_chunk = '' > while 1: > prior_chunk = chunk > # Note that CHUNK_SIZE here is in terms of decompressed data > chunk = in_file.read(CHUNK_SIZE) > if len(chunk) < CHUNK_SIZE: > break > > if len(chunk) < MAX_LINE: > chunk = prior_chunk + chunk > > line = chunk.splitlines(True)[-1] > print 'Last:', line > > > On the same test set as my last post, this reduced the > last-chunk timing from about 2.7s to about 2.3s. > > Now, if you're willing to play a little looser with the gzip > module, you can gain quite a bit more. If you directly call > the internal _read() method you can bypass some of the > unnecessary processing read() does, and go back to larger I/O chunks: > > # last-gzip.py > > import gzip > import sys > > CHUNK_SIZE = 1024*1024 > MAX_LINE = 255 > > in_file = gzip.open(sys.argv[1],'r') > > chunk = prior_chunk = '' > while 1: > try: > # Note that CHUNK_SIZE here is raw data size, not > decompressed > in_file._read(CHUNK_SIZE) > except EOFError: > if in_file.extrasize < MAX_LINE: > chunk = chunk + in_file.extrabuf > else: > chunk = in_file.extrabuf > break > > chunk = in_file.extrabuf > in_file.extrabuf = '' > in_file.extrasize = 0 > > line = chunk[-MAX_LINE:].splitlines(True)[-1] > print 'Last:', line > > Note that in this case since I was able to bump up > CHUNK_SIZE, I take a slice to limit the work splitlines() has > to do and the size of the resulting list. Using the larger > CHUNK_SIZE (and it being raw size) will use more memory, so > could be tuned down if necessary. > > Of course, the risk here is that you are dependent on the > _read() method, and the internal use of the > extrabuf/extrasize attributes, which is where _read() places > the decompressed data. In looking back I'm pretty sure this > code is safe at least for Python 2.4 through 3.0, but you'd > have to accept some risk in the future. > > This approach got me down to 1.48s. > > Then, just for the fun of it, once you're playing a little > looser with the gzip module, it's also doing work to compute > the crc of the original data for comparison with the > decompressed data. If you don't mind so much about that > (depends on what you're using the line for) you can just do > your own raw decompression with the zlib module, as in the > following code, although I still start with a GzipFile() > object to avoid having to rewrite the header processing: > > # last-decompress.py > > import gzip > import sys > import zlib > > CHUNK_SIZE = 1024*1024 > MAX_LINE = 255 > > decompress = zlib.decompressobj(-zlib.MAX_WBITS) > > in_file = gzip.open(sys.argv[1],'r') > in_file._read_gzip_header() > > chunk = prior_chunk = '' > while 1: > buf = in_file.fileobj.read(CHUNK_SIZE) > if not buf: > break > d_buf = decompress.decompress(buf) > # We might not have been at EOF in the read() but > still have no > # decompressed data if the only remaining data was > not original data > if d_buf: > prior_chunk = chunk > chunk = d_buf > > if len(chunk) < MAX_LINE: > chunk = prior_chunk + chunk > > line = chunk[-MAX_LINE:].splitlines(True)[-1] > print 'Last:', line > > This version got me down to 1.15s. > > So in summary, the choices when tested on my system ended up at: > > last 26 > last-chunk 2.7 > last-chunk-2 2.3 > last-popen 1.7 > last-gzip 1.48 > last-decompress 1.12 > > So by being willing to mix in some more direct code with the > GzipFile object, I was able to beat the overhead of shelling > out to the faster utilities, while remaining in pure Python. > > -- David > > -- http://mail.python.org/mailman/listinfo/python-list