Bugs item #849046, was opened at 2003-11-25 10:45 Message generated for change (Comment added) made by akuchling You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=849046&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Python Library Group: Python 2.4 >Status: Closed >Resolution: Fixed Priority: 3 Private: No Submitted By: Ronald Oussoren (ronaldoussoren) >Assigned to: Bob Ippolito (etrepum) Summary: gzip.GzipFile is slow Initial Comment: gzip.GzipFile is significantly (an order of a magnitude) slower than using the gzip binary. I've been bitten by this several times, and have replaced "fd = gzip.open('somefile', 'r')" by "fd = os.popen('gzcat somefile', 'r')" on several occassions. Would a patch that implemented GzipFile in C have any change of being accepted? ---------------------------------------------------------------------- >Comment By: A.M. Kuchling (akuchling) Date: 2007-01-05 09:42 Message: Logged In: YES user_id=11375 Originator: NO Patch #1281707 improved readline() performance and has been applied. I'll close this bug; please re-open if there are still performance issues. ---------------------------------------------------------------------- Comment By: April King (marumari) Date: 2005-05-04 12:18 Message: Logged In: YES user_id=747439 readlines(X) is even worse, as all it does is call readline() X times. readline() is also biased towards files where each line is less than 100 characters: readsize = min(100, size) So, if it's longer than that, it calls read, which calls _read, and so on. I've found using popen to be roughly 20x faster than using the gzip module. That's pretty bad. ---------------------------------------------------------------------- Comment By: Ronald Oussoren (ronaldoussoren) Date: 2003-12-28 11:25 Message: Logged In: YES user_id=580910 Leaving out the assignment sure sped thing up, but only because the input didn't contain lines anymore ;-) I did an experiment where I replaced self.extrabuf by a list, but that did slow things down. This may be because there seemed to be very few chunks in the buffer (most of the time just 2) According to profile.run('testit()') the function below spends about 50% of its time in the readline method: def testit() fd = gzip.open('testfile.gz', 'r') ln = fd.readline() cnt = bcnt = 0 while ln: ln = fd.readline() cnt += 1 bcnt += len(ln) print bcnt, cnt return bcnt,cnt testfile.gz is a simple textfile containing 40K lines of about 70 characters each. Replacing the 'buffers' in readline by a string (instead of a list) slightly speeds things up (about 10%). Other experiments did not bring any improvement. Even writing a simple C function to split the buffer returned by self.read() didn't help a lot (splitline(strval, max) -> match, rest, match is strval upto the first newline and at most max characters, rest is the rest of strval). ---------------------------------------------------------------------- Comment By: A.M. Kuchling (akuchling) Date: 2003-12-23 12:10 Message: Logged In: YES user_id=11375 It should be simple to check if the string operations are responsible -- comment out the 'self.extrabuf = self.extrabuf + data' in _add_read_data. If that makes a big difference, then _read should probably be building a list instead of modifying a string. ---------------------------------------------------------------------- Comment By: Brett Cannon (bcannon) Date: 2003-12-04 14:51 Message: Logged In: YES user_id=357491 Looking at GzipFile.read and ._read , I think a large chunk of time is burned in the decompression of small chunks of data. It initially reads and decompresses 1024 bits, and then if that read did not hit the EOF, it multiplies it by 2 and continues until the EOF is reached and then finishes up. The problem is that for each read a call to _read is made that sets up a bunch of objects. I would not be surprised if the object creation and teardown is hurting the performance. I would also not be surprised if the reading of small chunks of data is an initial problem as well. This is all guesswork, though, since I did not run the profiler on this. *But*, there might be a good reason for reading small chunks. If you are decompressing a large file, you might run out of memory very quickly by reading the file into memory *and* decompressing at the same time. Reading it in successively larger chunks means you don't hold the file's entire contents in memory at any one time. So the question becomes whether causing your memory to get overloaded and major thrashing on your swap space is worth the performance increase. There is also the option of inlining _read into 'read', but since it makes two calls that seems like poor abstraction and thus would most likely not be accepted as a solution. Might be better to just have some temporary storage in an attribute of objects that are used in every call to _read and then delete the attribute once the reading is done. Or maybe allow for an optional argument to read that allowed one to specify the initial read size (and that might be a good way to see if any of these ideas are reasonable; just modify the code to read the whole thing and go at it from that). But I am in no position to make any of these calls, though, since I never use gzip. If someone cares to write up a patch to try to fix any of this it will be considered. ---------------------------------------------------------------------- Comment By: Jim Jewett (jimjjewett) Date: 2003-11-25 17:05 Message: Logged In: YES user_id=764593 In the library, I see a fair amount of work that doesn't really do anything except make sure you're getting exactly a line at a time. Would it be an option to just read the file in all at once, split it on newlines, and then loop over the list? (Or read it into a cStringIO, I suppose.) ---------------------------------------------------------------------- Comment By: Ronald Oussoren (ronaldoussoren) Date: 2003-11-25 16:12 Message: Logged In: YES user_id=580910 To be more precise: $ ls -l gzippedfile -rw-r--r-- 1 ronald admin 354581 18 Nov 10:21 gzippedfile $ gzip -l gzippedfile compressed uncompr. ratio uncompressed_name 354581 1403838 74.7% gzippedfile The file contains about 45K lines of text (about 40 characters/line) $ time gzip -dc gzippedfile > /dev/null real 0m0.100s user 0m0.060s sys 0m0.000s $ python read.py gzippedfile > /dev/null real 0m3.222s user 0m3.020s sys 0m0.070s $ cat read.py #!/usr/bin/env python import sys import gzip fd = gzip.open(sys.argv[1], 'r') ln = fd.readline() while ln: sys.stdout.write(ln) ln = fd.readline() The difference is also significant for larger files (e.g. the difference is not caused by the different startup-times) ---------------------------------------------------------------------- Comment By: Ronald Oussoren (ronaldoussoren) Date: 2003-11-25 16:03 Message: Logged In: YES user_id=580910 The files are created using GzipFile. That speed is acceptable because it happens in a batch-job, reading back is the problem because that happens on demand and a user is waiting for the results. gzcat is a *uncompress* utility (specifically it is "gzip -dc"), the compression level is irrelevant for this discussion. The python code seems to do quite some string manipulation, maybe that is causing the slowdown (I'm using fd.readline() in a fairly tight loop). I'll do some profiling to check what is taking so much time. BTW. I'm doing this on Unix systems (Sun Solaris and Mac OS X). ---------------------------------------------------------------------- Comment By: Jim Jewett (jimjjewett) Date: 2003-11-25 12:35 Message: Logged In: YES user_id=764593 Which compression level are you using? It looks like most of the work is already done by zlib (which is in C), but GzipFile defaults to compression level 9. Many other zips (including your gzcat?) default to a lower (but much faster) compression level. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=849046&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com