New submission from Ruben Vorderman <r.h.p.vorder...@lumc.nl>:
Please consider the following code snippet: import gzip import sys with gzip.open(sys.argv[1], "rt") as in_file_h: with gzip.open(sys.argv[2], "wt", compresslevel=1) as out_file_h: for line in in_file_h: # Do processing on line here modified_line = line # End processing out_file_h.write(modified_line) This is very slow, due to write being called for every line. This is the current implementation of write: https://github.com/python/cpython/blob/c379bc5ec9012cf66424ef3d80612cf13ec51006/Lib/gzip.py#L272 It: - Checks if the file is not closed - Checks if the correct mode is set - Checks if the file is not closed (again, but in a different way) - Checks if the data is bytes, bytearray or something that supports the buffer protocol - Gets the length - Compresses the data - updates the size and offset - updates the checksum Doing this for every line written is very costly and creates a lot of overhead in Python calls. We spent a lot of time in Python and a lot less in the fast C zlib code that does the actual compression. This problem is already solved on the read side. A _GzipReader object is used for reading. This is put in an io.BufferedReader which is used as the underlying buffer for GzipFile.read. This way, lines are read quite fast from a GzipFile without the checksum etc. being updated on every line read. A similar solution should be written for write. I volunteer (I have done some other work on gzip.py already), although I cannot give an ETA at this time. ---------- messages: 403289 nosy: rhpvorderman priority: normal severity: normal status: open title: GzipFile.write should be buffered _______________________________________ Python tracker <rep...@bugs.python.org> <https://bugs.python.org/issue45387> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com