Re: A fast way to read last line of gzip archive ?
Dave, Not the OP, but really enjoyed your analysis and solution. Excellent job!! Thank you! Malcolm -- http://mail.python.org/mailman/listinfo/python-list
Re: A fast way to read last line of gzip archive ?
So in summary, the choices when tested on my system ended up at: last 26 last-chunk2.7 last-chunk-2 2.3 last-popen1.7 last-gzip 1.48 last-decompress 1.12 So by being willing to mix in some more direct code with the GzipFile object, I was able to beat the overhead of shelling out to the faster utilities, while remaining in pure Python. -- David Though I didn't need this practically, it was a fantastic reading. Gorgeus, David! Thanks for the ultimative post. -- http://mail.python.org/mailman/listinfo/python-list
RE: A fast way to read last line of gzip archive ?
Hi David, Thanks for the below solutions: most illuminating. I implemented your previous message suggestions, and already the processing time (on my datasets) is within acceptable human times. I'll try your suggestion below. Thanks again. Ron. > -Original Message- > From: David Bolen [mailto:db3l@gmail.com] > Sent: Tuesday, May 26, 2009 03:56 > To: python-list@python.org > Subject: Re: A fast way to read last line of gzip archive ? > > "Barak, Ron" writes: > > > I couldn't really go with the shell utilities approach, as > I have no > > say in my user environment, and thus cannot assume which > binaries are > > install on the user's machine. > > I suppose if you knew your target you could just supply the > external binaries to go with your application, but I agree > that would probably be more of a pain than its worth for the > performance gain in real world time. > > > I'll try and implement your last suggestion, and see if the > > performance is acceptable to (human) users. > > In terms of tuning the third option a bit, I'd play with the > tracking of the final two chunk (as mentioned in my first > response), perhaps shrinking the chunk size or only > processing a smaller chunk of it for lines (assuming a > reasonable line size) to minimize the final loop. > You could also try using splitlines() on the final buffer > rather than a StringIO wrapper, although that'll have a > memory hit for the constructed list but doing a small portion > of the buffer would minimize that. > > I was curious what I could actually achieve, so here are > three variants that I came up with. > > First, this just fine tunes slightly tracking the chunks and > then only processes enough final data based on anticipated > maximum length length (so if the final line is longer than > that you'll only get the final MAX_LINE bytes of that line). > I also found I got better performance using a smaller 1024 > chunk size with GZipFile.read() than a MB - not entirely sure > why although it perhaps matches the internal buffer size > better: > > # last-chunk-2.py > > import gzip > import sys > > CHUNK_SIZE = 1024 > MAX_LINE = 255 > > in_file = gzip.open(sys.argv[1],'r') > > chunk = prior_chunk = '' > while 1: > prior_chunk = chunk > # Note that CHUNK_SIZE here is in terms of decompressed data > chunk = in_file.read(CHUNK_SIZE) > if len(chunk) < CHUNK_SIZE: > break > > if len(chunk) < MAX_LINE: > chunk = prior_chunk + chunk > > line = chunk.splitlines(True)[-1] > print 'Last:', line > > > On the same test set as my last post, this reduced the > last-chunk timing from about 2.7s to about 2.3s. > > Now, if you're willing to play a little looser with the gzip > module, you can gain quite a bit more. If you directly call > the internal _read() method you can bypass some of the > unnecessary processing read() does, and go back to larger I/O chunks: > > # last-gzip.py > > import gzip > import sys > > CHUNK_SIZE = 1024*1024 > MAX_LINE = 255 > > in_file = gzip.open(sys.argv[1],'r') > > chunk = prior_chunk = '' > while 1: > try: > # Note that CHUNK_SIZE here is raw data size, not > decompressed > in_file._read(CHUNK_SIZE) > except EOFError: > if in_file.extrasize < MAX_LINE: > chunk = chunk + in_file.extrabuf > else: > chunk = in_file.extrabuf > break > > chunk = in_file.extrabuf > in_file.extrabuf = '' > in_file.extrasize = 0 > > line = chunk[-MAX_LINE:].splitlines(True)[-1] > print 'Last:', line > > Note that in this case since I was able to bump up > CHUNK_SIZE, I take a slice to limit the work splitlines() has > to do and the size of the resulting list. Using the larger > CHUNK_SIZE (and it being raw size) will use more memory, so > could be tuned down if necessary. > > Of course, the risk here is that you are dependent on the > _read() method, and the internal use of the > extrabuf/extrasize attributes, which is where _read() places > the decompressed data. In looking back I'm pretty sure this > code is safe at least for Python 2.4 through 3.0, but you'd > have to accept some risk in the future. > > This approach got me down to 1.48s. > > Then, just for the fun of it, once you're
Re: A fast way to read last line of gzip archive ?
"Barak, Ron" writes: > I couldn't really go with the shell utilities approach, as I have no > say in my user environment, and thus cannot assume which binaries > are install on the user's machine. I suppose if you knew your target you could just supply the external binaries to go with your application, but I agree that would probably be more of a pain than its worth for the performance gain in real world time. > I'll try and implement your last suggestion, and see if the > performance is acceptable to (human) users. In terms of tuning the third option a bit, I'd play with the tracking of the final two chunk (as mentioned in my first response), perhaps shrinking the chunk size or only processing a smaller chunk of it for lines (assuming a reasonable line size) to minimize the final loop. You could also try using splitlines() on the final buffer rather than a StringIO wrapper, although that'll have a memory hit for the constructed list but doing a small portion of the buffer would minimize that. I was curious what I could actually achieve, so here are three variants that I came up with. First, this just fine tunes slightly tracking the chunks and then only processes enough final data based on anticipated maximum length length (so if the final line is longer than that you'll only get the final MAX_LINE bytes of that line). I also found I got better performance using a smaller 1024 chunk size with GZipFile.read() than a MB - not entirely sure why although it perhaps matches the internal buffer size better: # last-chunk-2.py import gzip import sys CHUNK_SIZE = 1024 MAX_LINE = 255 in_file = gzip.open(sys.argv[1],'r') chunk = prior_chunk = '' while 1: prior_chunk = chunk # Note that CHUNK_SIZE here is in terms of decompressed data chunk = in_file.read(CHUNK_SIZE) if len(chunk) < CHUNK_SIZE: break if len(chunk) < MAX_LINE: chunk = prior_chunk + chunk line = chunk.splitlines(True)[-1] print 'Last:', line On the same test set as my last post, this reduced the last-chunk timing from about 2.7s to about 2.3s. Now, if you're willing to play a little looser with the gzip module, you can gain quite a bit more. If you directly call the internal _read() method you can bypass some of the unnecessary processing read() does, and go back to larger I/O chunks: # last-gzip.py import gzip import sys CHUNK_SIZE = 1024*1024 MAX_LINE = 255 in_file = gzip.open(sys.argv[1],'r') chunk = prior_chunk = '' while 1: try: # Note that CHUNK_SIZE here is raw data size, not decompressed in_file._read(CHUNK_SIZE) except EOFError: if in_file.extrasize < MAX_LINE: chunk = chunk + in_file.extrabuf else: chunk = in_file.extrabuf break chunk = in_file.extrabuf in_file.extrabuf = '' in_file.extrasize = 0 line = chunk[-MAX_LINE:].splitlines(True)[-1] print 'Last:', line Note that in this case since I was able to bump up CHUNK_SIZE, I take a slice to limit the work splitlines() has to do and the size of the resulting list. Using the larger CHUNK_SIZE (and it being raw size) will use more memory, so could be tuned down if necessary. Of course, the risk here is that you are dependent on the _read() method, and the internal use of the extrabuf/extrasize attributes, which is where _read() places the decompressed data. In looking back I'm pretty sure this code is safe at least for Python 2.4 through 3.0, but you'd have to accept some risk in the future. This approach got me down to 1.48s. Then, just for the fun of it, once you're playing a little looser with the gzip module, it's also doing work to compute the crc of the original data for comparison with the decompressed data. If you don't mind so much about that (depends on what you're using the line for) you can just do your own raw decompression with the zlib module, as in the following code, although I still start with a GzipFile() object to avoid having to rewrite the header processing: # last-decompress.py import gzip import sys import zlib CHUNK_SIZE = 1024*1024 MAX_LINE = 255 decompress = zlib.decompressobj(-zlib.MAX_WBITS) in_file = gzip.open(sys.argv[1],'r') in_file._read_gzip_header() chunk = prior_chunk = '' while 1: buf = in_file.fileobj.read(CHUNK_SIZE) if not buf: break d_buf = decompress.decompress(buf) # We might not have been at EOF in the read() but still have no # decompressed data if the only remaining data was not original data if d_buf: prior_chunk = chunk chunk = d_buf if len(chunk) < MAX_LINE: chunk = prior_chunk + chunk line = chunk[-MAX_LINE:].splitlines(True)[-1] print 'Last:', line This version got me down to 1.15s. So in summar
RE: A fast way to read last line of gzip archive ?
Thanks David: excellent suggestions! I couldn't really go with the shell utilities approach, as I have no say in my user environment, and thus cannot assume which binaries are install on the user's machine. I'll try and implement your last suggestion, and see if the performance is acceptable to (human) users. Bye, Ron. > -Original Message- > From: David Bolen [mailto:db3l@gmail.com] > Sent: Monday, May 25, 2009 01:58 > To: python-list@python.org > Subject: Re: A fast way to read last line of gzip archive ? > > "Barak, Ron" writes: > > > I thought maybe someone has a way to unzip just the end > portion of the > > archive (instead of the whole archive), as only the last part is > > needed for reading the last line. > > The problem is that gzip compressed output has no reliable > intermediate break points that you can jump to and just start > decompressing without having worked through the prior data. > > In your specific code, using readlines() is probably not > ideal as it will create the full list containing all of the > decoded file contents in memory only to let you pick the last > one. So a small optimization would be to just iterate > through the file (directly or by calling > readline()) until you reach the last line. > > However, since you don't care about the bulk of the file, but > only need to work with the final line in Python, this is an > activity that could be handled more efficiently handled with > external tools, as you need not involve much intepreter time > to actually decompress/discard the bulk of the file. > > For example, on my system, comparing these two cases: > > # last.py > > import gzip > import sys > > in_file = gzip.open(sys.argv[1],'r') > for line in in_file: > pass > print 'Last:', line > > > # last-popen.py > > import sys > from subprocess import Popen, PIPE > > # Implement gzip -dc | tail -1 > gzip = Popen(['gzip', '-dc', sys.argv[1]], stdout=PIPE) > tail = Popen(['tail', '-1'], stdin=gzip.stdout, stdout=PIPE) > line = tail.communicate()[0] > print 'Last:', line > > with an ~80MB log file compressed to about 8MB resulted in > last.py taking about 26 seconds, while last-popen took about > 1.7s. Both resulted in the same value in "line". As long as > you have local binaries for gzip/tail (such as Cygwin or > MingW or equivalent) this works fine on Windows systems too. > > If you really want to keep everything in Python, then I'd > suggest working to optimize the "skip" portion of the task, > trying to decompress the bulk of the file as quickly as > possible. For example, one possibility would be something like: > > # last-chunk.py > > import gzip > import sys > from cStringIO import StringIO > > in_file = gzip.open(sys.argv[1],'r') > > chunks = ['', ''] > while 1: > chunk = in_file.read(1024*1024) > if not chunk: > break > del chunks[0] > chunks.append(chunk) > > data = StringIO(''.join(chunks)) > for line in data: > pass > print 'Last:', line > > with the idea that you decode about a MB at a time, holding > onto the final two chunks (in case the actual final chunk > turns out to be smaller than one of your lines), and then > only process those for lines. There's probably some room for > tweaking the mechanism for holding onto just the last two > chunks, but I'm not sure it will make a major difference in > performance. > > In the same environment of mine as the earlier tests, the > above took about 2.7s. So still much slower than the > external utilities in percentage terms, but in absolute > terms, a second or so may not be critical for you compared to > pure Python. > > -- David > > -- http://mail.python.org/mailman/listinfo/python-list
Re: A fast way to read last line of gzip archive ?
"Barak, Ron" writes: > I thought maybe someone has a way to unzip just the end portion of > the archive (instead of the whole archive), as only the last part is > needed for reading the last line. The problem is that gzip compressed output has no reliable intermediate break points that you can jump to and just start decompressing without having worked through the prior data. In your specific code, using readlines() is probably not ideal as it will create the full list containing all of the decoded file contents in memory only to let you pick the last one. So a small optimization would be to just iterate through the file (directly or by calling readline()) until you reach the last line. However, since you don't care about the bulk of the file, but only need to work with the final line in Python, this is an activity that could be handled more efficiently handled with external tools, as you need not involve much intepreter time to actually decompress/discard the bulk of the file. For example, on my system, comparing these two cases: # last.py import gzip import sys in_file = gzip.open(sys.argv[1],'r') for line in in_file: pass print 'Last:', line # last-popen.py import sys from subprocess import Popen, PIPE # Implement gzip -dc | tail -1 gzip = Popen(['gzip', '-dc', sys.argv[1]], stdout=PIPE) tail = Popen(['tail', '-1'], stdin=gzip.stdout, stdout=PIPE) line = tail.communicate()[0] print 'Last:', line with an ~80MB log file compressed to about 8MB resulted in last.py taking about 26 seconds, while last-popen took about 1.7s. Both resulted in the same value in "line". As long as you have local binaries for gzip/tail (such as Cygwin or MingW or equivalent) this works fine on Windows systems too. If you really want to keep everything in Python, then I'd suggest working to optimize the "skip" portion of the task, trying to decompress the bulk of the file as quickly as possible. For example, one possibility would be something like: # last-chunk.py import gzip import sys from cStringIO import StringIO in_file = gzip.open(sys.argv[1],'r') chunks = ['', ''] while 1: chunk = in_file.read(1024*1024) if not chunk: break del chunks[0] chunks.append(chunk) data = StringIO(''.join(chunks)) for line in data: pass print 'Last:', line with the idea that you decode about a MB at a time, holding onto the final two chunks (in case the actual final chunk turns out to be smaller than one of your lines), and then only process those for lines. There's probably some room for tweaking the mechanism for holding onto just the last two chunks, but I'm not sure it will make a major difference in performance. In the same environment of mine as the earlier tests, the above took about 2.7s. So still much slower than the external utilities in percentage terms, but in absolute terms, a second or so may not be critical for you compared to pure Python. -- David -- http://mail.python.org/mailman/listinfo/python-list
RE: A fast way to read last line of gzip archive ?
> -Original Message- > From: garabik-news-2005...@kassiopeia.juls.savba.sk > [mailto:garabik-news-2005...@kassiopeia.juls.savba.sk] > Sent: Sunday, May 24, 2009 13:37 > To: python-list@python.org > Subject: Re: A fast way to read last line of gzip archive ? > > Barak, Ron wrote: > > > > > > > > I thought maybe someone has a way to unzip just the end > portion of the > > archive (instead of the whole archive), as only the last part is > > needed for reading the last line. > > dictzip (python implementation part of my serpento package) > you have to compress the file with dictzip, instead of gzip, > though (but dictzipped file is just a special way of > organizing the gzip file, so it remains perfectly compatible > with gunzip&comp.) Unfortunately, the gzip archive isn't created by me, and I have no say in how it's created. :-( Thanks, Ron. > > > -- > --- > | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ | > | __..--^^^--..__garabik @ kassiopeia.juls.savba.sk | > --- > Antivirus alert: file .signature infected by signature virus. > Hi! I'm a signature virus! Copy me into your signature file > to help me spread! > > -- http://mail.python.org/mailman/listinfo/python-list
Re: A fast way to read last line of gzip archive ?
Barak, Ron wrote: > > > > I thought maybe someone has a way to unzip just the end portion of the > archive (instead of the whole archive), as only the last part is needed > for reading the last line. dictzip (python implementation part of my serpento package) you have to compress the file with dictzip, instead of gzip, though (but dictzipped file is just a special way of organizing the gzip file, so it remains perfectly compatible with gunzip&comp.) -- --- | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ | | __..--^^^--..__garabik @ kassiopeia.juls.savba.sk | --- Antivirus alert: file .signature infected by signature virus. Hi! I'm a signature virus! Copy me into your signature file to help me spread! -- http://mail.python.org/mailman/listinfo/python-list
RE: A fast way to read last line of gzip archive ?
> -Original Message- > From: MRAB [mailto:goo...@mrabarnett.plus.com] > Sent: Thursday, May 21, 2009 19:02 > To: 'python-list@python.org' > Subject: Re: A fast way to read last line of gzip archive ? > > Barak, Ron wrote: > > Hi, > > > > I need to read the end of a 20 MB gzip archives (To extract > the date > > from the last line of a a gzipped log file). > > The solution I have below takes noticeable time to reach the end of > > the gzip archive. > > > > Does anyone have a faster solution to read the last line of > a gzip archive ? > > > > Thanks, > > Ron. > > > > #!/usr/bin/env python > > > > import gzip > > > > path = "./a/20/mb/file.tgz" > > > > in_file = gzip.open(path, "r") > > first_line = in_file.readline() > > print "first_line ==",first_line > > in_file.seek(-500) > > last_line = in_file.readlines()[-1] > > print "last_line ==",last_line > > > It takes a noticeable time to reach the end because, well, > the data is compressed! The compression method used requires > the preceding data to be read first. I thought maybe someone has a way to unzip just the end portion of the archive (instead of the whole archive), as only the last part is needed for reading the last line. Bye, Ron. -- http://mail.python.org/mailman/listinfo/python-list
Re: A fast way to read last line of gzip archive ?
Barak, Ron wrote: Hi, I need to read the end of a 20 MB gzip archives (To extract the date from the last line of a a gzipped log file). The solution I have below takes noticeable time to reach the end of the gzip archive. Does anyone have a faster solution to read the last line of a gzip archive ? Thanks, Ron. #!/usr/bin/env python import gzip path = "./a/20/mb/file.tgz" in_file = gzip.open(path, "r") first_line = in_file.readline() print "first_line ==",first_line in_file.seek(-500) last_line = in_file.readlines()[-1] print "last_line ==",last_line It takes a noticeable time to reach the end because, well, the data is compressed! The compression method used requires the preceding data to be read first. -- http://mail.python.org/mailman/listinfo/python-list
A fast way to read last line of gzip archive ?
Hi, I need to read the end of a 20 MB gzip archives (To extract the date from the last line of a a gzipped log file). The solution I have below takes noticeable time to reach the end of the gzip archive. Does anyone have a faster solution to read the last line of a gzip archive ? Thanks, Ron. #!/usr/bin/env python import gzip path = "./a/20/mb/file.tgz" in_file = gzip.open(path, "r") first_line = in_file.readline() print "first_line ==",first_line in_file.seek(-500) last_line = in_file.readlines()[-1] print "last_line ==",last_line -- http://mail.python.org/mailman/listinfo/python-list