[issue20962] Rather modest chunk size in gzip.GzipFile

2016-04-22 Thread Martin Panter

Martin Panter added the comment:

Since there doesn’t seem to be much interest here any more, and the current 
code has changed and now uses 8 KiB buffering, I am closing this. Although in 
theory a buffer or chunk size paramter could still be added to the new code if 
there was a need.

--
resolution:  -> out of date
status: open -> closed
superseder:  -> Limit decompressed data when reading from LZMAFile and BZ2File
versions:  -Python 3.4

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2016-04-22 Thread Martin Panter

Changes by Martin Panter :


--
Removed message: http://bugs.python.org/msg263972

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2016-04-22 Thread lissacoffey

lissacoffey added the comment:

I measured both the total time of the run, the time to process each input 
records, and time to execute just the seek() call for each record. The bulk of 
the per-record time was in the call to seek(), so by reducing that time, I sped 
up my run-times significantly.
Thanks
http://qwikfix.co.uk/barclays/

--
nosy: +editor-buzzfeed
status: pending -> open

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2015-04-18 Thread Martin Panter

Martin Panter added the comment:

The gzip (as well as LZMA and bzip) modules should now use buffer and chunk 
sizes of 8 KiB (= io.DEFAULT_BUFFER_SIZE) for most read() and seek() type 
operations.

I have a patch that adds a buffer_size parameter to the three compression 
modules if anyone is interested. It may need a bit work, e.g. adding the 
parameter to open(), mimicking the built-in open() function when buffer_size=0, 
etc.

I did a quick test of seeking 100 MB into a gzip file, using the original 
Python 3.4.3 module, the current code that uses 8 KiB chunk sizes, and then my 
patched code with various chunk sizes. It looks like 8 KiB is significantly 
better than the previous code. My tests are peaking at about 64 KiB, but I 
guess that depends on the computer (cache etc). Anyway, 8 KiB seems like a good 
compromise without hogging all the fast memory cache or whatever, so I suggest 
we close this bug.

Command line for timing looked like:

python -m timeit -s 'import gzip' \
'gzip.GzipFile(100M.gz, buffer_size=8192).seek(int(100e6))'

Python 3.4.3: 10 loops, best of 3: 2.36 sec per loop
Currently (8 KiB chunking): 10 loops, best of 3: 693 msec per loop
buffer_size=1024: 10 loops, best of 3: 2.46 sec per loop
buffer_size=8192: 10 loops, best of 3: 677 msec per loop
buffer_size=16 * 1024: 10 loops, best of 3: 502 msec per loop
buffer_size=int(60e3): 10 loops, best of 3: 400 msec per loop
buffer_size=64 * 1024: 10 loops, best of 3: 398 msec per loop
buffer_size=int(80e3): 10 loops, best of 3: 406 msec per loop
buffer_size=16 * 8192: 10 loops, best of 3: 469 msec per loop

--
status: open - pending

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2015-04-13 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Martin, do you think this is still an issue or has it been fixed by the 
compression refactor?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2015-03-15 Thread Martin Panter

Martin Panter added the comment:

See also the patch for Issue 23529, which changes over to using BufferedReader 
for GzipFile, BZ2File and LZMAFile. The current patch there also passes a 
buffer_size parameter through to BufferedReader, although it currently defaults 
to io.DEFAULT_BUFFER_SIZE.

--
nosy: +vadmium

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-28 Thread William Tisäter

William Tisäter added the comment:

That makes sense.

I proceeded and updated `Lib/gzip.py` to use `io.DEFAULT_BUFFER_SIZE` instead. 
This will change the existing behaviour in two ways:

* Start using 1024 * 8 as buffer size instead of 1024.

* Add one more kwarg (`buffer_size`) to `GzipFile.__init__()`.

Ps. This is my first patch, tell me if I'm missing something.

--
Added file: http://bugs.python.org/file35074/20962_default-buffer-size.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-28 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 So I'd suggest, instead of using an hardcoded value, to simply reuse 
 io.DEFAULT_BUFFER_SIZE.
 That way, if some day we decide to change it, all user code wil benefit from 
 the change.

I don't think io.DEFAULT_BUFFER_SIZE makes much sense as a heuristic for the 
gzip module (or compressed files in general). Perhaps gzip should get its own 
DEFAULT_BUFFER_SIZE?

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-28 Thread Charles-François Natali

Charles-François Natali added the comment:

 I don't think io.DEFAULT_BUFFER_SIZE makes much sense as a heuristic for the 
 gzip module (or compressed files in general). Perhaps gzip should get its own 
 DEFAULT_BUFFER_SIZE?

Do you mean from a namespace point of vue, or from a performance point of view?
Because this size is used to read/write from underlying the file
object, so using the io default would make sense, no?

Sure, it might not be optimal for compressed files, but I gues that
the optimal value is function of the compression-level block size and
many other factors which are just too varied to come up with a
reasonable heuristic.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-28 Thread Antoine Pitrou

Antoine Pitrou added the comment:

 Sure, it might not be optimal for compressed files, but I gues that
 the optimal value is function of the compression-level block size and
 many other factors which are just too varied to come up with a
 reasonable heuristic.

Well, I think that compressed files in general would benefit from a
larger buffer size than plain binary I/O, but that's just a hunch.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-28 Thread Skip Montanaro

Skip Montanaro added the comment:

On Mon, Apr 28, 2014 at 1:59 PM, Antoine Pitrou rep...@bugs.python.org wrote:
 Well, I think that compressed files in general would benefit from a
 larger buffer size than plain binary I/O, but that's just a hunch.

I agree. When writing my patch, my (perhaps specious) thinking went like this.

* We have a big-ass file, so we compress it.
* On average, when seeking to another point in that file, we probably
want to go a long way.
* It's possible that operating system read-ahead semantics will make
read performance relatively high.
* That would put more burden on the Python code to be efficient.
* Larger buffer sizes will reduce the amount of Python bytecode which
must be executed.

So, if I have a filesystem block size of 8192 bytes, while that would
represent some sort of optimal chunk size, in practice, I think
operating system read-ahead and post-read processing of the bytes read
will tend to suggest larger chunk sizes. Hence my naive choice of 16k
bytes for _CHUNK_SIZE in my patch.

Skip

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-28 Thread Charles-François Natali

Charles-François Natali added the comment:

That could make sense, dunno.

Note that the bz2 module uses a harcoded 8K value.

Note that the buffer size should probably be passed to the open() call.

Also, the allocation is quite peculiar: it uses an exponential buffer
size, starting at a tiny value:

202 # Starts small, scales exponentially
203 self.min_readsize = 100

In short, I think the overall buffering should be rewritten :-)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-28 Thread Skip Montanaro

Skip Montanaro added the comment:

On Mon, Apr 28, 2014 at 3:08 PM, Charles-François Natali
rep...@bugs.python.org wrote:
 In short, I think the overall buffering should be rewritten :-)

Perhaps so, but I think we should open a separate ticket for that
instead of instituting some feature creep here (no matter how
reasonable the concept or its changes would be).

S

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-28 Thread Charles-François Natali

Charles-François Natali added the comment:

 Perhaps so, but I think we should open a separate ticket for that
 instead of instituting some feature creep here (no matter how
 reasonable the concept or its changes would be).

Agreed.

The patch looks good to me, so feel free to commit!
(FWIW, gzip apparently uses a fixed-32K buffer read).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-27 Thread Charles-François Natali

Charles-François Natali added the comment:

William, thanks for the benchmarks.

Unfortunately this type of benchmark depends on the hardware (disk, SSD, emmoey 
bandwitdh, etc).

So I'd suggest, instead of using an hardcoded value, to simply reuse 
io.DEFAULT_BUFFER_SIZE.
That way, if some day we decide to change it, all user code wil benefit from 
the change.

--
nosy: +neologix

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-24 Thread William Tisäter

William Tisäter added the comment:

I played around with different file and chunk sizes using attached benchmark 
script.

After several test runs I think 1024 * 16 would be the biggest win without 
losing too many μs on small seeks. You can find my benchmark output here: 
https://gist.github.com/tiwilliam/11273483

My test data was generated with following commands:

dd if=/dev/random of=10K bs=1024 count=10
dd if=/dev/random of=1M bs=1024 count=1000
dd if=/dev/random of=5M bs=1024 count=5000
dd if=/dev/random of=100M bs=1024 count=10
dd if=/dev/random of=1000M bs=1024 count=100
gzip 10K 1M 5M 100M 1000M

--
nosy: +tiwilliam
Added file: http://bugs.python.org/file35029/benchmark_20962.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-21 Thread Antoine Pitrou

Changes by Antoine Pitrou pit...@free.fr:


--
nosy: +nadeem.vawda, serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-20 Thread Ezio Melotti

Ezio Melotti added the comment:

You should try with different chunk and file sizes and see what is the best 
compromise.  Tagging as easy in case someone wants to put together a small 
script to benchmark this (maybe it could even be added to 
http://hg.python.org/benchmarks/), or even a patch.

--
keywords: +easy
nosy: +ezio.melotti
stage:  - needs patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-04-20 Thread Skip Montanaro

Skip Montanaro added the comment:

Here's a straightforward patch. I didn't want to change the public API of the 
module, so just defined the chunk size with a leading underscore. Gzip tests 
continue to pass.

--
keywords: +patch
stage: needs patch - patch review
Added file: http://bugs.python.org/file34987/gzip.diff

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue20962] Rather modest chunk size in gzip.GzipFile

2014-03-17 Thread Skip Montanaro

New submission from Skip Montanaro:

I've had the opportunity to use the seek() method of the gzip.GzipFile class 
for the first time in the past few days. Wondering why it seemed my processing 
times were so slow, I took a look at the code for seek() and read(). It seems 
like the chunk size for reading (1024 bytes) is rather small. I created a 
simple subclass that overrode just seek() and read(), then defined a CHUNK_SIZE 
to be 16 * 8192 bytes (the whole idea of compressing files is that they get 
large, right? seems like most of the time we will want to seek pretty far 
through the file).

Over a small subset of my inputs, I measured about a 2x decrease in run times, 
from about 54s to 26s. I ran using both gzip.GzipFile and my subclass several 
times, measuring the last four runs (two using the stdlib implementation, two 
using my subclass). I measured both the total time of the run, the time to 
process each input records, and time to execute just the seek() call for each 
record. The bulk of the per-record time was in the call to seek(), so by 
reducing that time, I sped up my run-times significantly.

I'm still using 2.7, but other than the usual 2.x-3.x changes, the code looks 
pretty much the same between 2.7 and (at least) 3.3, and the logic involving 
the read size doesn't seem to have changed at all.

I'll try to produce a patch if I have a few minutes, but in the meantime, I've 
attached my modified GzipFile class (produced against 2.7).

--
components: Library (Lib)
files: gzipseek.py
messages: 213883
nosy: skip.montanaro
priority: normal
severity: normal
status: open
title: Rather modest chunk size in gzip.GzipFile
type: performance
versions: Python 3.4, Python 3.5
Added file: http://bugs.python.org/file34466/gzipseek.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue20962
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com