Re: A fast way to read last line of gzip archive ?

2009-05-27 Thread python
Dave,

Not the OP, but really enjoyed your analysis and solution. Excellent
job!!

Thank you!
Malcolm
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A fast way to read last line of gzip archive ?

2009-05-27 Thread Igor Katson




So in summary, the choices when tested on my system ended up at:

last 26
last-chunk2.7
last-chunk-2  2.3
last-popen1.7
last-gzip 1.48
last-decompress   1.12

So by being willing to mix in some more direct code with the GzipFile
object, I was able to beat the overhead of shelling out to the faster
utilities, while remaining in pure Python.

-- David
  
Though I didn't need this practically, it was a fantastic reading. 
Gorgeus, David! Thanks for the ultimative post.

--
http://mail.python.org/mailman/listinfo/python-list


RE: A fast way to read last line of gzip archive ?

2009-05-26 Thread Barak, Ron
Hi David, 
Thanks for the below solutions: most illuminating.
I implemented your previous message suggestions, and already the processing 
time (on my datasets) is within acceptable human times.
I'll try your suggestion below.
Thanks again.
Ron.

> -Original Message-
> From: David Bolen [mailto:db3l@gmail.com] 
> Sent: Tuesday, May 26, 2009 03:56
> To: python-list@python.org
> Subject: Re: A fast way to read last line of gzip archive ?
> 
> "Barak, Ron"  writes:
> 
> > I couldn't really go with the shell utilities approach, as 
> I have no 
> > say in my user environment, and thus cannot assume which 
> binaries are 
> > install on the user's machine.
> 
> I suppose if you knew your target you could just supply the 
> external binaries to go with your application, but I agree 
> that would probably be more of a pain than its worth for the 
> performance gain in real world time.
> 
> > I'll try and implement your last suggestion, and see if the 
> > performance is acceptable to (human) users.
> 
> In terms of tuning the third option a bit, I'd play with the 
> tracking of the final two chunk (as mentioned in my first 
> response), perhaps shrinking the chunk size or only 
> processing a smaller chunk of it for lines (assuming a 
> reasonable line size) to minimize the final loop.
> You could also try using splitlines() on the final buffer 
> rather than a StringIO wrapper, although that'll have a 
> memory hit for the constructed list but doing a small portion 
> of the buffer would minimize that.
> 
> I was curious what I could actually achieve, so here are 
> three variants that I came up with.
> 
> First, this just fine tunes slightly tracking the chunks and 
> then only processes enough final data based on anticipated 
> maximum length length (so if the final line is longer than 
> that you'll only get the final MAX_LINE bytes of that line).  
> I also found I got better performance using a smaller 1024 
> chunk size with GZipFile.read() than a MB - not entirely sure 
> why although it perhaps matches the internal buffer size
> better:
> 
> # last-chunk-2.py
> 
> import gzip
> import sys
> 
> CHUNK_SIZE = 1024
> MAX_LINE = 255
> 
> in_file = gzip.open(sys.argv[1],'r')
> 
> chunk = prior_chunk = ''
> while 1:
> prior_chunk = chunk
> # Note that CHUNK_SIZE here is in terms of decompressed data
> chunk = in_file.read(CHUNK_SIZE)
> if len(chunk) < CHUNK_SIZE:
> break
> 
> if len(chunk) < MAX_LINE:
> chunk = prior_chunk + chunk
> 
> line = chunk.splitlines(True)[-1]
> print 'Last:', line
> 
> 
> On the same test set as my last post, this reduced the 
> last-chunk timing from about 2.7s to about 2.3s.
> 
> Now, if you're willing to play a little looser with the gzip 
> module, you can gain quite a bit more.  If you directly call 
> the internal _read() method you can bypass some of the 
> unnecessary processing read() does, and go back to larger I/O chunks:
> 
> # last-gzip.py
> 
> import gzip
> import sys
> 
> CHUNK_SIZE = 1024*1024
> MAX_LINE = 255
> 
> in_file = gzip.open(sys.argv[1],'r')
> 
> chunk = prior_chunk = ''
> while 1:
> try:
> # Note that CHUNK_SIZE here is raw data size, not 
> decompressed
> in_file._read(CHUNK_SIZE)
> except EOFError:
> if in_file.extrasize < MAX_LINE:
> chunk = chunk + in_file.extrabuf
> else:
> chunk = in_file.extrabuf
> break
> 
> chunk = in_file.extrabuf
> in_file.extrabuf = ''
> in_file.extrasize = 0
> 
> line = chunk[-MAX_LINE:].splitlines(True)[-1]
> print 'Last:', line
> 
> Note that in this case since I was able to bump up 
> CHUNK_SIZE, I take a slice to limit the work splitlines() has 
> to do and the size of the resulting list.  Using the larger 
> CHUNK_SIZE (and it being raw size) will use more memory, so 
> could be tuned down if necessary.
> 
> Of course, the risk here is that you are dependent on the 
> _read() method, and the internal use of the 
> extrabuf/extrasize attributes, which is where _read() places 
> the decompressed data.  In looking back I'm pretty sure this 
> code is safe at least for Python 2.4 through 3.0, but you'd 
> have to accept some risk in the future.
> 
> This approach got me down to 1.48s.
> 
> Then, just for the fun of it, once you're

Re: A fast way to read last line of gzip archive ?

2009-05-25 Thread David Bolen
"Barak, Ron"  writes:

> I couldn't really go with the shell utilities approach, as I have no
> say in my user environment, and thus cannot assume which binaries
> are install on the user's machine.

I suppose if you knew your target you could just supply the external
binaries to go with your application, but I agree that would probably
be more of a pain than its worth for the performance gain in real
world time.

> I'll try and implement your last suggestion, and see if the
> performance is acceptable to (human) users.

In terms of tuning the third option a bit, I'd play with the tracking
of the final two chunk (as mentioned in my first response), perhaps
shrinking the chunk size or only processing a smaller chunk of it for
lines (assuming a reasonable line size) to minimize the final loop.
You could also try using splitlines() on the final buffer rather than
a StringIO wrapper, although that'll have a memory hit for the
constructed list but doing a small portion of the buffer would
minimize that.

I was curious what I could actually achieve, so here are three variants
that I came up with.

First, this just fine tunes slightly tracking the chunks and then only
processes enough final data based on anticipated maximum length length
(so if the final line is longer than that you'll only get the final
MAX_LINE bytes of that line).  I also found I got better performance
using a smaller 1024 chunk size with GZipFile.read() than a MB - not
entirely sure why although it perhaps matches the internal buffer size
better:

# last-chunk-2.py

import gzip
import sys

CHUNK_SIZE = 1024
MAX_LINE = 255

in_file = gzip.open(sys.argv[1],'r')

chunk = prior_chunk = ''
while 1:
prior_chunk = chunk
# Note that CHUNK_SIZE here is in terms of decompressed data
chunk = in_file.read(CHUNK_SIZE)
if len(chunk) < CHUNK_SIZE:
break

if len(chunk) < MAX_LINE:
chunk = prior_chunk + chunk

line = chunk.splitlines(True)[-1]
print 'Last:', line


On the same test set as my last post, this reduced the last-chunk
timing from about 2.7s to about 2.3s.

Now, if you're willing to play a little looser with the gzip module,
you can gain quite a bit more.  If you directly call the internal _read()
method you can bypass some of the unnecessary processing read() does, and
go back to larger I/O chunks:

# last-gzip.py

import gzip
import sys

CHUNK_SIZE = 1024*1024
MAX_LINE = 255

in_file = gzip.open(sys.argv[1],'r')

chunk = prior_chunk = ''
while 1:
try:
# Note that CHUNK_SIZE here is raw data size, not decompressed
in_file._read(CHUNK_SIZE)
except EOFError:
if in_file.extrasize < MAX_LINE:
chunk = chunk + in_file.extrabuf
else:
chunk = in_file.extrabuf
break

chunk = in_file.extrabuf
in_file.extrabuf = ''
in_file.extrasize = 0

line = chunk[-MAX_LINE:].splitlines(True)[-1]
print 'Last:', line

Note that in this case since I was able to bump up CHUNK_SIZE, I take
a slice to limit the work splitlines() has to do and the size of the
resulting list.  Using the larger CHUNK_SIZE (and it being raw size) will
use more memory, so could be tuned down if necessary.

Of course, the risk here is that you are dependent on the _read()
method, and the internal use of the extrabuf/extrasize attributes,
which is where _read() places the decompressed data.  In looking back
I'm pretty sure this code is safe at least for Python 2.4 through 3.0,
but you'd have to accept some risk in the future.

This approach got me down to 1.48s.

Then, just for the fun of it, once you're playing a little looser with
the gzip module, it's also doing work to compute the crc of the
original data for comparison with the decompressed data.  If you don't
mind so much about that (depends on what you're using the line for)
you can just do your own raw decompression with the zlib module, as in
the following code, although I still start with a GzipFile() object to
avoid having to rewrite the header processing:

# last-decompress.py

import gzip
import sys
import zlib

CHUNK_SIZE = 1024*1024
MAX_LINE = 255

decompress = zlib.decompressobj(-zlib.MAX_WBITS)

in_file = gzip.open(sys.argv[1],'r')
in_file._read_gzip_header()

chunk = prior_chunk = ''
while 1:
buf = in_file.fileobj.read(CHUNK_SIZE)
if not buf:
break
d_buf = decompress.decompress(buf)
# We might not have been at EOF in the read() but still have no
# decompressed data if the only remaining data was not original data
if d_buf:
prior_chunk = chunk
chunk = d_buf

if len(chunk) < MAX_LINE:
chunk = prior_chunk + chunk

line = chunk[-MAX_LINE:].splitlines(True)[-1]
print 'Last:', line

This version got me down to 1.15s.

So in summar

RE: A fast way to read last line of gzip archive ?

2009-05-24 Thread Barak, Ron
Thanks David: excellent suggestions!
I couldn't really go with the shell utilities approach, as I have no say in my 
user environment, and thus cannot assume which binaries are install on the 
user's machine.
I'll try and implement your last suggestion, and see if the performance is 
acceptable to (human) users.
Bye,
Ron.

> -Original Message-
> From: David Bolen [mailto:db3l@gmail.com] 
> Sent: Monday, May 25, 2009 01:58
> To: python-list@python.org
> Subject: Re: A fast way to read last line of gzip archive ?
> 
> "Barak, Ron"  writes:
> 
> > I thought maybe someone has a way to unzip just the end 
> portion of the 
> > archive (instead of the whole archive), as only the last part is 
> > needed for reading the last line.
> 
> The problem is that gzip compressed output has no reliable 
> intermediate break points that you can jump to and just start 
> decompressing without having worked through the prior data.
> 
> In your specific code, using readlines() is probably not 
> ideal as it will create the full list containing all of the 
> decoded file contents in memory only to let you pick the last 
> one.  So a small optimization would be to just iterate 
> through the file (directly or by calling
> readline()) until you reach the last line.
> 
> However, since you don't care about the bulk of the file, but 
> only need to work with the final line in Python, this is an 
> activity that could be handled more efficiently handled with 
> external tools, as you need not involve much intepreter time 
> to actually decompress/discard the bulk of the file.
> 
> For example, on my system, comparing these two cases:
> 
> # last.py
> 
> import gzip
> import sys
> 
> in_file = gzip.open(sys.argv[1],'r')
> for line in in_file:
> pass
> print 'Last:', line
> 
> 
> # last-popen.py
> 
> import sys
> from subprocess import Popen, PIPE
> 
> # Implement gzip -dc  | tail -1
> gzip = Popen(['gzip', '-dc', sys.argv[1]], stdout=PIPE)
> tail = Popen(['tail', '-1'], stdin=gzip.stdout, stdout=PIPE)
> line = tail.communicate()[0]
> print 'Last:', line
> 
> with an ~80MB log file compressed to about 8MB resulted in 
> last.py taking about 26 seconds, while last-popen took about 
> 1.7s.  Both resulted in the same value in "line".  As long as 
> you have local binaries for gzip/tail (such as Cygwin or 
> MingW or equivalent) this works fine on Windows systems too.
> 
> If you really want to keep everything in Python, then I'd 
> suggest working to optimize the "skip" portion of the task, 
> trying to decompress the bulk of the file as quickly as 
> possible.  For example, one possibility would be something like:
> 
> # last-chunk.py
> 
> import gzip
> import sys
> from cStringIO import StringIO
> 
> in_file = gzip.open(sys.argv[1],'r')
> 
> chunks = ['', '']
> while 1:
> chunk = in_file.read(1024*1024)
> if not chunk:
> break
> del chunks[0]
> chunks.append(chunk)
> 
> data = StringIO(''.join(chunks))
> for line in data:
> pass
> print 'Last:', line
> 
> with the idea that you decode about a MB at a time, holding 
> onto the final two chunks (in case the actual final chunk 
> turns out to be smaller than one of your lines), and then 
> only process those for lines.  There's probably some room for 
> tweaking the mechanism for holding onto just the last two 
> chunks, but I'm not sure it will make a major difference in 
> performance.
> 
> In the same environment of mine as the earlier tests, the 
> above took about 2.7s.  So still much slower than the 
> external utilities in percentage terms, but in absolute 
> terms, a second or so may not be critical for you compared to 
> pure Python.
> 
> -- David
> 
> 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A fast way to read last line of gzip archive ?

2009-05-24 Thread David Bolen
"Barak, Ron"  writes:

> I thought maybe someone has a way to unzip just the end portion of
> the archive (instead of the whole archive), as only the last part is
> needed for reading the last line.

The problem is that gzip compressed output has no reliable
intermediate break points that you can jump to and just start
decompressing without having worked through the prior data.

In your specific code, using readlines() is probably not ideal as it
will create the full list containing all of the decoded file contents
in memory only to let you pick the last one.  So a small optimization
would be to just iterate through the file (directly or by calling
readline()) until you reach the last line.

However, since you don't care about the bulk of the file, but only
need to work with the final line in Python, this is an activity that
could be handled more efficiently handled with external tools, as you
need not involve much intepreter time to actually decompress/discard
the bulk of the file.

For example, on my system, comparing these two cases:

# last.py

import gzip
import sys

in_file = gzip.open(sys.argv[1],'r')
for line in in_file:
pass
print 'Last:', line


# last-popen.py

import sys
from subprocess import Popen, PIPE

# Implement gzip -dc  | tail -1
gzip = Popen(['gzip', '-dc', sys.argv[1]], stdout=PIPE)
tail = Popen(['tail', '-1'], stdin=gzip.stdout, stdout=PIPE)
line = tail.communicate()[0]
print 'Last:', line

with an ~80MB log file compressed to about 8MB resulted in last.py
taking about 26 seconds, while last-popen took about 1.7s.  Both
resulted in the same value in "line".  As long as you have local
binaries for gzip/tail (such as Cygwin or MingW or equivalent) this
works fine on Windows systems too.

If you really want to keep everything in Python, then I'd suggest
working to optimize the "skip" portion of the task, trying to
decompress the bulk of the file as quickly as possible.  For example,
one possibility would be something like:

# last-chunk.py

import gzip
import sys
from cStringIO import StringIO

in_file = gzip.open(sys.argv[1],'r')

chunks = ['', '']
while 1:
chunk = in_file.read(1024*1024)
if not chunk:
break
del chunks[0]
chunks.append(chunk)

data = StringIO(''.join(chunks))
for line in data:
pass
print 'Last:', line

with the idea that you decode about a MB at a time, holding onto the
final two chunks (in case the actual final chunk turns out to be
smaller than one of your lines), and then only process those for
lines.  There's probably some room for tweaking the mechanism for
holding onto just the last two chunks, but I'm not sure it will make
a major difference in performance.

In the same environment of mine as the earlier tests, the above took
about 2.7s.  So still much slower than the external utilities in
percentage terms, but in absolute terms, a second or so may not be
critical for you compared to pure Python.

-- David
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: A fast way to read last line of gzip archive ?

2009-05-24 Thread Barak, Ron
 

> -Original Message-
> From: garabik-news-2005...@kassiopeia.juls.savba.sk 
> [mailto:garabik-news-2005...@kassiopeia.juls.savba.sk] 
> Sent: Sunday, May 24, 2009 13:37
> To: python-list@python.org
> Subject: Re: A fast way to read last line of gzip archive ?
> 
> Barak, Ron  wrote:
> > 
> > 
> > 
> > I thought maybe someone has a way to unzip just the end 
> portion of the 
> > archive (instead of the whole archive), as only the last part is 
> > needed for reading the last line.
> 
> dictzip (python implementation part of my serpento package) 
> you have to compress the file with dictzip, instead of gzip, 
> though (but dictzipped file is just a special way of 
> organizing the gzip file, so it remains perfectly compatible 
> with gunzip&comp.)

Unfortunately, the gzip archive isn't created by me, and I have no say in how 
it's created.
:-(

Thanks,
Ron.

> 
> 
> --
>  ---
> | Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
> | __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
>  ---
> Antivirus alert: file .signature infected by signature virus.
> Hi! I'm a signature virus! Copy me into your signature file 
> to help me spread!
> 
> 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A fast way to read last line of gzip archive ?

2009-05-24 Thread garabik-news-2005-05
Barak, Ron  wrote:
> 
> 
> 
> I thought maybe someone has a way to unzip just the end portion of the
> archive (instead of the whole archive), as only the last part is needed
> for reading the last line.

dictzip (python implementation part of my serpento package)
you have to compress the file with dictzip, instead of gzip, though
(but dictzipped file is just a special way of organizing the gzip file,
so it remains perfectly compatible with gunzip&comp.)


-- 
 ---
| Radovan Garabík http://kassiopeia.juls.savba.sk/~garabik/ |
| __..--^^^--..__garabik @ kassiopeia.juls.savba.sk |
 ---
Antivirus alert: file .signature infected by signature virus.
Hi! I'm a signature virus! Copy me into your signature file to help me spread!
-- 
http://mail.python.org/mailman/listinfo/python-list


RE: A fast way to read last line of gzip archive ?

2009-05-23 Thread Barak, Ron
 

> -Original Message-
> From: MRAB [mailto:goo...@mrabarnett.plus.com] 
> Sent: Thursday, May 21, 2009 19:02
> To: 'python-list@python.org'
> Subject: Re: A fast way to read last line of gzip archive ?
> 
> Barak, Ron wrote:
> > Hi,
> >  
> > I need to read the end of a 20 MB gzip archives (To extract 
> the date 
> > from the last line of a a gzipped log file).
> > The solution I have below takes noticeable time to reach the end of 
> > the gzip archive.
> >  
> > Does anyone have a faster solution to read the last line of 
> a gzip archive ?
> >  
> > Thanks,
> > Ron.
> >  
> > #!/usr/bin/env python
> >  
> > import gzip
> >  
> > path = "./a/20/mb/file.tgz"
> >  
> > in_file = gzip.open(path, "r")
> > first_line = in_file.readline()
> > print "first_line ==",first_line
> > in_file.seek(-500)
> > last_line = in_file.readlines()[-1]
> > print "last_line ==",last_line
> > 
> It takes a noticeable time to reach the end because, well, 
> the data is compressed! The compression method used requires 
> the preceding data to be read first.

I thought maybe someone has a way to unzip just the end portion of the archive 
(instead of the whole archive), as only the last part is needed for reading the 
last line.

Bye,
Ron.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: A fast way to read last line of gzip archive ?

2009-05-21 Thread MRAB

Barak, Ron wrote:

Hi,
 
I need to read the end of a 20 MB gzip archives (To extract the date 
from the last line of a a gzipped log file).
The solution I have below takes noticeable time to reach the end of the 
gzip archive.
 
Does anyone have a faster solution to read the last line of a gzip archive ?
 
Thanks,

Ron.
 
#!/usr/bin/env python
 
import gzip
 
path = "./a/20/mb/file.tgz"
 
in_file = gzip.open(path, "r")

first_line = in_file.readline()
print "first_line ==",first_line
in_file.seek(-500)
last_line = in_file.readlines()[-1]
print "last_line ==",last_line


It takes a noticeable time to reach the end because, well, the data is
compressed! The compression method used requires the preceding data to
be read first.
--
http://mail.python.org/mailman/listinfo/python-list


A fast way to read last line of gzip archive ?

2009-05-21 Thread Barak, Ron
Hi,

I need to read the end of a 20 MB gzip archives (To extract the date from the 
last line of a a gzipped log file).
The solution I have below takes noticeable time to reach the end of the gzip 
archive.

Does anyone have a faster solution to read the last line of a gzip archive ?

Thanks,
Ron.

#!/usr/bin/env python

import gzip

path = "./a/20/mb/file.tgz"

in_file = gzip.open(path, "r")
first_line = in_file.readline()
print "first_line ==",first_line
in_file.seek(-500)
last_line = in_file.readlines()[-1]
print "last_line ==",last_line



-- 
http://mail.python.org/mailman/listinfo/python-list