from:"Ruben Vorderman"

[issue46267] gzip.compress incorrectly ignores level parameter

2022-03-02 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

ping

--

___
Python tracker 
<https://bugs.python.org/issue46267>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46267] gzip.compress incorrectly ignores level parameter

2022-01-05 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
keywords: +patch
pull_requests: +28622
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/30416

___
Python tracker 
<https://bugs.python.org/issue46267>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue46267] gzip.compress incorrectly ignores level parameter

2022-01-05 Thread Ruben Vorderman



New submission from Ruben Vorderman :

def compress(data, compresslevel=_COMPRESS_LEVEL_BEST, *, mtime=None):
"""Compress data in one shot and return the compressed string.

compresslevel sets the compression level in range of 0-9.
mtime can be used to set the modification time. The modification time is
set to the current time by default.
"""
if mtime == 0:
# Use zlib as it creates the header with 0 mtime by default.
# This is faster and with less overhead.
return zlib.compress(data, level=compresslevel, wbits=31)
header = _create_simple_gzip_header(compresslevel, mtime)
trailer = struct.pack("
<https://bugs.python.org/issue46267>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue24301] gzip module failing to decompress valid compressed file

2021-12-31 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

ping

--

___
Python tracker 
<https://bugs.python.org/issue24301>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue24301] gzip module failing to decompress valid compressed file

2021-11-29 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
keywords: +patch
pull_requests: +28076
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/29847

___
Python tracker 
<https://bugs.python.org/issue24301>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue24301] gzip module failing to decompress valid compressed file

2021-11-29 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

Whoops. Sorry, I spoke before my turn. If gzip implements it, it seems only 
logical that python's *gzip* module should too. 
I believe it can be fixed quite easily. The code should raise a warning though. 
I will make a PR.

--

___
Python tracker 
<https://bugs.python.org/issue24301>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45875] gzip.decompress performance can be improved with memoryviews

2021-11-29 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue45875>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue24301] gzip module failing to decompress valid compressed file

2021-11-29 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

>From the spec:

https://datatracker.ietf.org/doc/html/rfc1952


   2.2. File format

  A gzip file consists of a series of "members" (compressed data
  sets).  The format of each member is specified in the following
  section.  The members simply appear one after another in the file,
  with no additional information before, between, or after them.


Gzip files with garbage after them are corrupted or not spec compliant. 
Therefore the gzip module should raise an error in this case.

--
nosy: +rhpvorderman

___
Python tracker 
<https://bugs.python.org/issue24301>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

I used it for the median calculation of FASTQ quality scores 
(https://en.wikipedia.org/wiki/FASTQ_format). But in the end I used the 
frequency table to calculate the median more quickly. So as you say, the 
frequency table turned out to be more useful.

Having said that the usefulness depends on how many times 8-bit data is passed 
into sorted. (bytes,bytearrays, most python strings are 8-bit I believe). I 
raised this issue not because I want a .sort() method on bytes or 
bytearrays, but mostly because I think python's sorted function can be improved 
with regards to 8-bit data. I think it is an interesting thing to consider, 
depending on how often this occurs.

For example:
sorted(b'Let\'s test a proper string now. One that has some value to be 
sorted.')
and 
list(bytes_sort(b'Let\'s test a proper string now. One that has some value to 
be sorted.'))

This returns the same result (a list of integers). But the byte_sort 
implementation is 2.5 times faster. So sorted is not optimally implemented here.

Since sorted is now basically throwing everything into list.sort an alternative 
codepath using bytes.sort can be considered. (If there are enough use cases for 
it).

--

___
Python tracker 
<https://bugs.python.org/issue45902>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

Sorry for the spam. I see I made a typo in the timeit script. Next time I will 
be more dilligent when making these kinds of reports and triple checking it 
before hand, and sending it once. I used -c instead of -s and now all the setup 
time is also included. This confounds the results.

The proper test commands should be:

python -m timeit -s "from bytes_sort import bytes_sort, bytearray_sort_inplace" 
"bytes_sort(b'My string here')"

python -m timeit "bytes(sorted(b'My string here'))"

Using just sorted, to purely compare the sorting algorithms without the 
overhead of creating a new bytes object.
python -m timeit "sorted(b'My string here')"

Correct comparison results
# String = b''
using bytes(sorted: 188 nsec
using sorted:   108 nsec
using byte_sort:125 nsec  # Some overhead here, setting up the countarray
# String = b'abc'
using bytes(sorted: 252 nsec
using sorted:   145 nsec
using byte_sort:136 nsec  # Overhead compared to sorted already negated 
when sorting 3 items(!)
# String = b'Let\'s test a proper string now. One that has some value to be 
sorted.'
using bytes(sorted: 1830 nsec (reported as 1.83 usec)
using sorted:   1550 nsec (reported as 1.55 usec)
using byte_sort: 220 nsec

--

___
Python tracker 
<https://bugs.python.org/issue45902>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

I changed the cython script a bit to use a more naive implementation without 
memset.
Now it is always significantly faster than bytes(sorted(my_bytes)).

$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'')"
50 loops, best of 5: 495 nsec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'abc')"
50 loops, best of 5: 519 nsec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'Let\'s 
test a proper string now. One that has some value to be sorted.')"
50 loops, best of 5: 594 nsec per loop

--

___
Python tracker 
<https://bugs.python.org/issue45902>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

Also I didn't know if this should be in Component C-API or Interpreter Core. 
But I guess this will be implemented as C-API calls PyBytes_Sort and 
PyByteArray_SortInplace so I figured C-API is the correct component here.

--

___
Python tracker 
<https://bugs.python.org/issue45902>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45902] Bytes and bytesarrays can be sorted with a much faster count sort.

2021-11-26 Thread Ruben Vorderman



New submission from Ruben Vorderman :

Python now uses the excellent timsort for most (all?) of its sorting. But this 
is not the fastest sort available for one particular use case.

If the number of possible values in the array is limited, it is possible to 
perform a counting sort: https://en.wikipedia.org/wiki/Counting_sort. In a 
counting sort each value maps to an integer corresponding to its relative 
value. The values are then counted by using key = map_to_key(value); 
count_array[key]. Then from this count_array a new array of sorted values can 
be constructed. (See the wikipedia article for a more detailed explanation).

For the python bytes and bytesarray types this is extremely simple to 
implement. All 256 possible values are can be directly used as keys. 

Rough implementation:
- Use buffer protocol to get pointer to bytes/bytesarray internal c-string
- Declare a count_array: Py_ssize_t[256] count_array . (use memset to set it to 
0)
- Iterate over the c-string and add each value to the countarray. 
count_array[buffer[i]] += 1
- Allocate a new bytes(array) object, or in the case of bytesarray the sorting 
can be performed inplace when bytesarray.sort() is used. 
- Iterate over the count_array. Get the number of values for each key and use 
memset to insert the sequences of keys into position.


The most obvious way to implement this algorithm will be as bytesarray.sort() 
where it is sorted inplace and as bytes.sort() which returns a new sorted bytes 
object. This is much much faster than using bytes(sorted(bytes)).

I made a quick cython implementation for speed testing here: 
https://github.com/rhpvorderman/bytes_sort/blob/main/bytes_sort.pyx

Currently to get a sorted bytestring one has to do bytes(sorted(my_bytes)). 

Test results:

# First make sure there is no regression when sorting an empty string
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes(sorted(b''))"
50 loops, best of 5: 560 nsec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'')"
50 loops, best of 5: 565 nsec per loop

# Test result for very small strings
$ python -m timeit -c "from bytes_sort import bytes_sort" 
"bytes(sorted(b'abc'))"
50 loops, best of 5: 628 nsec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'abc')"
50 loops, best of 5: 578 nsec per loop

# Even on a very small already sorted string, a counting sort is faster.

# Test with a proper string
$ python -m timeit -c "from bytes_sort import bytes_sort" 
"bytes(sorted(b'Let\'s test a proper string now. One that has some value to be 
sorted.'))"
10 loops, best of 5: 2.32 usec per loop
$ python -m timeit -c "from bytes_sort import bytes_sort" "bytes_sort(b'Let\'s 
test a proper string now. One that has some value to be sorted.')"
50 loops, best of 5: 674 nsec per loop

More than three times faster!

--
components: C API
messages: 407032
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: Bytes and bytesarrays can be sorted with a much faster count sort.
type: performance
versions: Python 3.11

___
Python tracker 
<https://bugs.python.org/issue45902>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45509] Gzip header corruption not properly checked.

2021-11-24 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

I have found that using the timeit module provides more precise measurements:

For a simple gzip header. (As returned by gzip.compress or zlib.compress with 
wbits=31)
./python -m timeit -s "import io; data = 
b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00';
 from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))'


For a gzip header with FNAME. (Returned by gzip itself and by Python's GzipFile)
./python -m timeit -s "import io; data = 
b'\x1f\x8b\x08\x08j\x1a\x9ea\x02\xffcompressable_file\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00\x00';
 from gzip import _read_gzip_header" '_read_gzip_header(io.BytesIO(data))'

For a gzip header with all flags set:
./python -m timeit -s 'import gzip, io; data = 
b"\x1f\x8b\x08\x1f\x00\x00\x00\x00\x00\xff\x05\x00extraname\x00comment\x00\xe9T";
 from gzip import _read_gzip_header' '_read_gzip_header(io.BytesIO(data))'


Since performance is most critical for in-memory compression and decompression, 
I now optimized for no flags.
Before (current main): 50 loops, best of 5: 469 nsec per loop
after (PR): 100 loops, best of 5: 390 nsec per loop

For the most common case of only FNAME set:
before: 20 loops, best of 5: 1.48 usec per loop
after: 20 loops, best of 5: 1.45 usec per loop

For the case where FCHRC is set:
before: 20 loops, best of 5: 1.62 usec per loop
after: 10 loops, best of 5: 2.43 usec per loop


So this PR is now a clear win for decompressing anything that has been 
compressed with gzip.compress. It is neutral for normal file decompression. 
There is a performance cost associated with correctly checking the header, but 
that is expected. It is better than the alternative of not checking it.

--
Added file: https://bugs.python.org/file50459/benchmark_gzip_read_header.py

___
Python tracker 
<https://bugs.python.org/issue45509>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45875] gzip.decompress performance can be improved with memoryviews

2021-11-22 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

Tried and failed. It seems that the overhead of creating a new memoryview 
object beats the performance gained by it.

--

___
Python tracker 
<https://bugs.python.org/issue45875>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45875] gzip.decompress performance can be improved with memoryviews

2021-11-22 Thread Ruben Vorderman



New submission from Ruben Vorderman :

The current implementation uses a lot of bytestring slicing. While it is much 
better than the 3.10 and earlier implementations, it can still be further 
improved by using memoryviews instead.

Possibly. I will check this out.

--
components: Library (Lib)
messages: 406816
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: gzip.decompress performance can be improved with memoryviews
type: performance
versions: Python 3.11

___
Python tracker 
<https://bugs.python.org/issue45875>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45509] Gzip header corruption not properly checked.

2021-11-22 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

I increased the performance of the patch. I added the file used for 
benchmarking. I also test the FHCRC changes now.

The benchmark tests headers with different flags concatenated to a DEFLATE 
block with no data and a gzip trailer. The data is fed to gzip.decompress. 
Please note that this is the *worst-case* performance overhead. When there is 
actual data to decompress the overhead will get less. When GzipFile is used the 
overhead will get less as well.

BEFORE (Current main branch):
$ ./python benchmark_gzip_read_header.py 
with_fname
average: 3.01, range: 2.9-4.79 stdev: 0.19
with_noflags
average: 2.99, range: 2.93-3.04 stdev: 0.02
All flags (incl FHCRC)
average: 3.13, range: 3.05-3.16 stdev: 0.02


After (bpo-45509 PR):
with_fname
average: 3.09, range: 3.01-4.63 stdev: 0.16
with_noflags
average: 3.1, range: 3.03-3.38 stdev: 0.04
All flags (incl FHCRC)
average: 4.09, range: 4.05-4.49 stdev: 0.04

An increase of .1 microsecond in the most common use cases. Roughly 3%. But now 
the FNAME field is correctly checked for truncation.

With the FHCRC the overhead is increased by 33%. But this is worth it, because 
the header is now actually checked. As it should.

--
Added file: https://bugs.python.org/file50458/benchmark_gzip_read_header.py

___
Python tracker 
<https://bugs.python.org/issue45509>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45509] Gzip header corruption not properly checked.

2021-11-22 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

1. Quite a lot

I tested it for the two most common use case. 
import timeit
import statistics

WITH_FNAME = """
from gzip import GzipFile, decompress
import io
fileobj = io.BytesIO()
g = GzipFile(fileobj=fileobj, mode='wb', filename='compressable_file')
g.write(b'')
g.close()
data=fileobj.getvalue()
"""
WITH_NO_FLAGS = """
from gzip import decompress
import zlib
data = zlib.compress(b'', wbits=31)
"""

def benchmark(name, setup, loops=1, runs=10):
print(f"{name}")
results = [timeit.timeit("decompress(data)", setup, number=loops) for _ in 
range(runs)]
# Calculate microseconds
results = [(result / loops) * 1_000_000 for result in results]
print(f"average: {round(statistics.mean(results), 2)}, "
  f"range: {round(min(results), 2)}-{round(max(results),2)} "
  f"stdev: {round(statistics.stdev(results),2)}")


if __name__ == "__main__":
benchmark("with_fname", WITH_FNAME)
benchmark("with_noflags", WITH_FNAME)

BEFORE:

with_fname
average: 3.27, range: 3.21-3.36 stdev: 0.05
with_noflags
average: 3.24, range: 3.14-3.37 stdev: 0.07

AFTER:
with_fname
average: 4.98, range: 4.85-5.14 stdev: 0.1
with_noflags
average: 4.87, range: 4.69-5.05 stdev: 0.1

That is a dramatic increase in overhead. (Okay the decompressed data is empty, 
but still)

2. Haven't tested this yet. But the regression is quite unacceptable already.

3. Not that I know of. But if it is set, it is safe to assume they care. 
Nevertheless this is a bit of an edge-case.

--

___
Python tracker 
<https://bugs.python.org/issue45509>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45509] Gzip header corruption not properly checked.

2021-11-22 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

Ping

--

___
Python tracker 
<https://bugs.python.org/issue45509>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45509] Gzip header corruption not properly checked.

2021-11-05 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

Bump. This is a bug that allows corrupted gzip files to be processed  without 
error. Therefore I bump this issue in the hopes someone will review the PR.

--

___
Python tracker 
<https://bugs.python.org/issue45509>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45507] Small oversight in 3.11 gzip.decompress implementation with regards to backwards compatibility

2021-11-05 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

bump. This is a regression introduced by 
https://github.com/python/cpython/pull/27941

--

___
Python tracker 
<https://bugs.python.org/issue45507>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45507] Small oversight in 3.11 gzip.decompress implementation with regards to backwards compatibility

2021-10-18 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
pull_requests: +27301
pull_request: https://github.com/python/cpython/pull/29029

___
Python tracker 
<https://bugs.python.org/issue45507>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45509] Gzip header corruption not properly checked.

2021-10-18 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
keywords: +patch
pull_requests: +27300
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/29028

___
Python tracker 
<https://bugs.python.org/issue45509>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45509] Gzip header corruption not properly checked.

2021-10-18 Thread Ruben Vorderman



New submission from Ruben Vorderman :

The following headers are currently allowed while being wrong:

- Headers with FCOMMENT flag set, but with incomplete or missing COMMENT bytes.
- Headers with FNAME flag set, but with incomplete or missing NAME bytes
- Headers with FHCRC set, the crc is read, but not checked.

--
components: Library (Lib)
messages: 404174
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: Gzip header corruption not properly checked.
type: behavior
versions: Python 3.11

___
Python tracker 
<https://bugs.python.org/issue45509>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45507] Small oversight in 3.11 gzip.decompress implementation with regards to backwards compatibility

2021-10-18 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

It turns out there is a bug where FNAME and/or FCOMMENT flags are set in the 
header, but no error is thrown when NAME and COMMENT fields are missing.

--

___
Python tracker 
<https://bugs.python.org/issue45507>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45507] Small oversight in 3.11 gzip.decompress implementation with regards to backwards compatibility

2021-10-18 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
keywords: +patch
pull_requests: +27296
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/29023

___
Python tracker 
<https://bugs.python.org/issue45507>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45507] Small oversight in 3.11 gzip.decompress implementation with regards to backwards compatibility

2021-10-18 Thread Ruben Vorderman



New submission from Ruben Vorderman :

A 'struct.error: unpack requires a buffer of 8 bytes' is thrown when a gzip 
trailer is truncated instead of an EOFError such as in the 3.10 and prior 
releases.

--
components: Library (Lib)
messages: 404165
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: Small oversight in 3.11 gzip.decompress implementation with regards to 
backwards compatibility
type: behavior
versions: Python 3.11

___
Python tracker 
<https://bugs.python.org/issue45507>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45387] GzipFile.write should be buffered

2021-10-06 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
components: +Library (Lib)
type:  -> performance
versions: +Python 3.10, Python 3.11, Python 3.6, Python 3.7, Python 3.8, Python 
3.9

___
Python tracker 
<https://bugs.python.org/issue45387>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45387] GzipFile.write should be buffered

2021-10-06 Thread Ruben Vorderman



New submission from Ruben Vorderman :

Please consider the following code snippet:

import gzip
import sys

with gzip.open(sys.argv[1], "rt") as in_file_h:
with gzip.open(sys.argv[2], "wt", compresslevel=1) as out_file_h:
for line in in_file_h:
# Do processing on line here
modified_line = line
# End processing
out_file_h.write(modified_line)

This is very slow, due to write being called for every line. This is the 
current implementation of write:
https://github.com/python/cpython/blob/c379bc5ec9012cf66424ef3d80612cf13ec51006/Lib/gzip.py#L272

It:
- Checks if the file is not closed
- Checks if the correct mode is set
- Checks if the file is not closed (again, but in a different way)
- Checks if the data is bytes, bytearray or something that supports the buffer 
protocol
- Gets the length
- Compresses the data
- updates the size and offset
- updates the checksum

Doing this for every line written is very costly and creates a lot of overhead 
in Python calls. We spent a lot of time in Python and a lot less in the fast C 
zlib code that does the actual compression.

This problem is already solved on the read side. A _GzipReader object is used 
for reading. This is put in an io.BufferedReader which is used as the 
underlying buffer for GzipFile.read. This way, lines  are read quite fast from 
a GzipFile without the checksum etc. being updated on every line read.

A similar solution should be written for write.
I volunteer (I have done some other work on gzip.py already), although I cannot 
give an ETA at this time.

--
messages: 403289
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: GzipFile.write should be buffered

___
Python tracker 
<https://bugs.python.org/issue45387>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.

2021-09-03 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

Issue was solved by moving code from _GzipReader to separate functions and 
maintaining the same error structure. 
This solved the problem with maximum code reuse and full backwards 
compatibility.

--

___
Python tracker 
<https://bugs.python.org/issue43613>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.

2021-09-03 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue43613>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43612] zlib.compress should have a wbits argument

2021-09-03 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

Thanks for the review, Lukasz! It was fun to create the PR and optimize the 
performance for gzip.py as well.

--

___
Python tracker 
<https://bugs.python.org/issue43612>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43612] zlib.compress should have a wbits argument

2021-08-25 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
pull_requests: +26387
pull_request: https://github.com/python/cpython/pull/27941

___
Python tracker 
<https://bugs.python.org/issue43612>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.

2021-08-25 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
keywords: +patch
pull_requests: +26386
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/27941

___
Python tracker 
<https://bugs.python.org/issue43613>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43612] zlib.compress should have a wbits argument

2021-04-26 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

A patch was created, but has not been reviewed yet.

--

___
Python tracker 
<https://bugs.python.org/issue43612>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43612] zlib.compress should have a wbits argument

2021-04-26 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
versions: +Python 3.11

___
Python tracker 
<https://bugs.python.org/issue43612>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.

2021-03-25 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

I created bpo-43621 for the error issue. There should only be BadGzipFile. Once 
that is fixed, having only one error type will make it easier to implement some 
functions that are shared across the gzip.py codebase.

--

___
Python tracker 
<https://bugs.python.org/issue43613>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43621] gzip._GzipReader should only throw BadGzipFile errors

2021-03-25 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
type:  -> behavior

___
Python tracker 
<https://bugs.python.org/issue43621>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43612] zlib.compress should have a wbits argument

2021-03-25 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
components: +Extension Modules -Library (Lib)

___
Python tracker 
<https://bugs.python.org/issue43612>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43621] gzip._GzipReader should only throw BadGzipFile errors

2021-03-25 Thread Ruben Vorderman



New submission from Ruben Vorderman :

This is properly documented: 
https://docs.python.org/3/library/gzip.html#gzip.BadGzipFile . 

It now hrows EOFErrors when a stream is truncated. But this means that upstream 
both BadGzipFile and EOFError need to be catched in the exception handling when 
opening a gzip file for reading. When a gzip file is truncated it is also a 
"bad gzip file" in my opinion, so there is no reason to have an extra class of 
errors.
Also it throws zlib.error's when zlib craches for some reason. This means there 
is some corruption in the raw deflate block. Well that means it is a "bad gzip 
file" as well and the error message should reflect that. 

This won't break people's code. If they are already catching EOFError 
zlib.error and BadGzipFile it changes nothing. If they only catch BadGzipFile, 
they will have less annoying errors that pop through.

I can make the PR, but of course not without any feedback. I am curious what 
other people think.

--
components: Library (Lib)
messages: 389494
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: gzip._GzipReader should only throw BadGzipFile errors
versions: Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue43621>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.

2021-03-24 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
type:  -> performance

___
Python tracker 
<https://bugs.python.org/issue43613>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43612] zlib.compress should have a wbits argument

2021-03-24 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
type:  -> enhancement

___
Python tracker 
<https://bugs.python.org/issue43612>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43612] zlib.compress should have a wbits argument

2021-03-24 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
keywords: +patch
pull_requests: +23768
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/25011

___
Python tracker 
<https://bugs.python.org/issue43612>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43613] gzip.compress and gzip.decompress are sub-optimally implemented.

2021-03-24 Thread Ruben Vorderman



New submission from Ruben Vorderman :

When working on python-isal which aims to provide faster drop-in replacements 
for the zlib and gzip modules I found that the gzip.compress and 
gzip.decompress are suboptimally implemented which hurts performance.

gzip.compress and gzip.decompress both do the following things:
- Instantiate a BytesIO object to mimick a file
- Instantiate a GzipFile object to compress or read the file.

That means there is way more Python code involved than strictly necessary. Also 
the 'data' is already fully in memory, but the data is streamed anyway. That is 
quite a waste.

I propose the following:
- The documentation should make it clear that zlib.decompress(... ,wbits=31) 
and zlib.compress(..., wbits=31) (after 43612 has been addressed), are both 
quicker but come with caveats. zlib.compress can not set mtime. zlib.decompress 
does not take multimember gzip into account. 
- For gzip.compress -> The GzipFile._write_gzip_header function should be moved 
to a module wide _gzip_header function that returns a bytes object. 
GzipFile._write_gzip_header can call this function. gzip.compress can also call 
this function to create a header. gzip.compress than calls zlib.compress(data, 
wbits=-15) (after 43612 has been fixed) to create a raw deflate block. A gzip 
trailer can be easily created by calling zlib.crc32(data) and len(data) & 
0x and packing those into a struct. See for an example implementation 
here: 
https://github.com/pycompression/python-isal/blob/v0.8.0/src/isal/igzip.py#L242
-> For gzip.decompress it becomes quite more involved. A read_gzip_header 
function can be created, but the current implementation returns EOFErrors if 
the header is incomplete due to a truncated file instead of BadGzipFile errors. 
This makes it harder to implement something that is not a major break from 
current gzip.decompress. Apart from the header, the implementation is 
straightforward. Do a while true loop. All operations are performed in the 
loop. Validate the header and report the end of the header. Create a 
zlib.decompressobj(wbits=-15). Decompress all the data from the end of header. 
Flush. Extract the crc and length from the first 8 bytes of the unused data. 
data = decompobj.unused_data[8:]. if not data: break. For a reference 
implementation check here: 
https://github.com/pycompression/python-isal/blob/v0.8.0/src/isal/igzip.py#L300.
 Note that the decompress function is quite straightforward. Checking the 
header however while maintaining backwards compatibility with gzip.deco
 mpress is not so simple.

And that brings to another point. Should non-descriptive EOFErrors be raised 
when reading the gzip header? Or throw informative BadGzipFile errors when the 
header is parsed. I tend towards the latter. For example BadGzipFile("Truncated 
header") instead of EOFError. Or at least EOFError("Truncated gzip header"). I 
am aware that confounds this issue with another issue, but these things are 
coupled in the implementation so both need to be solved at the same time.

Given the headaches that gzip.decompress gives it might be easier to solve 
gzip.compress first in a first PR and do gzip.decompress later.

--
messages: 389438
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: gzip.compress and gzip.decompress are sub-optimally implemented.

___
Python tracker 
<https://bugs.python.org/issue43613>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43612] zlib.compress should have a wbits argument

2021-03-24 Thread Ruben Vorderman



New submission from Ruben Vorderman :

zlib.compress can now only be used to output zlib blocks.
Arguably `zlib.compress(my_data, level, wbits=-15)` is even more useful as it 
gives you a raw deflate block. That is quite interesting if you are writing 
your own file format and want to use compression, but like to use a different 
hash.

Also gzip.compress(data, level, mtime) is extremely slow due to it 
instantiating a GzipFile object which then streams a bytes object. Explicitly 
not taking advantage of the fact that the bytes object is entirely in memory 
already (I will create another bug for this). zlib.compress(my_data, level, 
wbits=31) should be faster in all possible circumstances, but that option is 
not available now.

--
components: Library (Lib)
messages: 389437
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: zlib.compress should have a wbits argument
versions: Python 3.10

___
Python tracker 
<https://bugs.python.org/issue43612>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43316] python -m gzip handles error incorrectly

2021-02-25 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
keywords: +patch
pull_requests: +23432
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/24647

___
Python tracker 
<https://bugs.python.org/issue43316>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43316] python -m gzip handles error incorrectly

2021-02-25 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

That sounds perfect, I didn't think of that. I will make a PR.

--

___
Python tracker 
<https://bugs.python.org/issue43316>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43317] python -m gzip could use a larger buffer

2021-02-25 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
keywords: +patch
pull_requests: +23430
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/24645

___
Python tracker 
<https://bugs.python.org/issue43317>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43317] python -m gzip could use a larger buffer

2021-02-24 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
type:  -> performance

___
Python tracker 
<https://bugs.python.org/issue43317>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43316] python -m gzip handles error incorrectly

2021-02-24 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
type:  -> behavior

___
Python tracker 
<https://bugs.python.org/issue43316>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43317] python -m gzip could use a larger buffer

2021-02-24 Thread Ruben Vorderman


New submission from Ruben Vorderman :

python -m gzip reads in chunks of 1024 bytes: 
https://github.com/python/cpython/blob/1f433406bd46fbd00b88223ad64daea6bc9eaadc/Lib/gzip.py#L599

This hurts performance somewhat. Using io.DEFAULT_BUFFER_SIZE will improve it. 
Also 'io.DEFAULT_BUFFER_SIZE' is better than: 
'ARBITRARY_NUMBER_WITH_NO_COMMENT_EXPLAINING_WHY'.

With 1024 blocks
Decompression:
$ hyperfine -r 10 -w 3 'cat ~/test/50reads.fastq.gz | ./prefix/bin/python3 
-m gzip -d > /dev/null'
Benchmark #1: cat ~/test/50reads.fastq.gz | ./prefix/bin/python3 -m gzip -d 
> /dev/null
  Time (mean ± σ): 926.9 ms ±   7.7 ms[User: 901.2 ms, System: 59.1 ms]
  Range (min … max):   913.3 ms … 939.4 ms10 runs

Compression:
$ hyperfine -r 10 -w 3 'cat ~/test/50reads.fastq | ./prefix/bin/python3 -m 
gzip --fast > /dev/null'
Benchmark #1: cat ~/test/50reads.fastq | ./prefix/bin/python3 -m gzip 
--fast > /dev/null
  Time (mean ± σ):  2.514 s ±  0.030 s[User: 2.469 s, System: 0.125 s]
  Range (min … max):2.472 s …  2.563 s10 runs


with io.DEFAULT_BUFFER_SIZE
Decompression:
$ hyperfine -r 10 -w 3 'cat ~/test/50reads.fastq.gz | ./prefix/bin/python3 
-m gzip -d > /dev/null'
Benchmark #1: cat ~/test/50reads.fastq.gz | ./prefix/bin/python3 -m gzip -d 
> /dev/null
  Time (mean ± σ): 839.9 ms ±   7.3 ms[User: 816.0 ms, System: 57.3 ms]
  Range (min … max):   830.1 ms … 851.3 ms10 runs

Compression:
$ hyperfine -r 10 -w 3 'cat ~/test/50reads.fastq | ./prefix/bin/python3 -m 
gzip --fast > /dev/null'
Benchmark #1: cat ~/test/50reads.fastq | ./prefix/bin/python3 -m gzip 
--fast > /dev/null
  Time (mean ± σ):  2.275 s ±  0.024 s[User: 2.247 s, System: 0.096 s]
  Range (min … max):2.254 s …  2.322 s10 runs


Speedups: 
- Decompression 840 / 927 = 0.906 ~= 9% reduction in runtime
- Compression 2.275 / 2.514 = 0.905 ~= 9% reduction in runtime.

It is not stellar, but it is a quite nice improvement for such a tiny change.

--
components: Library (Lib)
messages: 387624
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: python -m gzip could use a larger buffer
versions: Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue43317>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue43316] python -m gzip handles error incorrectly

2021-02-24 Thread Ruben Vorderman



New submission from Ruben Vorderman :

`Python -m gzip -d myfile` will throw an error because myfile does not end in 
'.gz'. That is fair (even though a bit redundant, GzipFile contains a header 
check, so why bother checking the extension?).

The problem is how this error is thrown.
1. Error is printed to stdout instead of stderr
2. Tool exits with exit 0.

This is not the behaviour that is expected when using python -m gzip in a 
script.

The error is even codified in a test: 
https://github.com/python/cpython/blob/1f433406bd46fbd00b88223ad64daea6bc9eaadc/Lib/test/test_gzip.py#L776

def test_decompress_infile_outfile_error(self):
rc, out, err = assert_python_ok('-m', 'gzip', '-d', 'thisisatest.out')
self.assertIn(b"filename doesn't end in .gz:", out)
self.assertEqual(rc, 0)
self.assertEqual(err, b'')

This should be assert_python_failure, out and err should be swapped, and exit 
code should be something different than 0.

>From the zen of python: Errors should never pass silently.

I am willing to fix this in a PR, but first I would like some feedback on how 
to solve this exactly. 

I propose raising a ValueError("can not determine output filename: 'myfile' 
does not end in '.gz'").

--
components: Library (Lib)
messages: 387622
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: python -m gzip handles error incorrectly
versions: Python 3.10, Python 3.6, Python 3.7, Python 3.8, Python 3.9

___
Python tracker 
<https://bugs.python.org/issue43316>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)

2020-09-15 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

Hi, thanks all for the comments and the help.

I have created the bindings using Cython. The project is still a work in 
progress as of this moment. I leave the link here for future reference.

Special thanks for the Cython developers for enabling these bindings.

https://github.com/rhpvorderman/python-isal

--
resolution:  -> third party
stage: needs patch -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue41566>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)

2020-08-24 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

> If you take this route, please don't write it directly against the CPython 
> C-API (as you would for a CPython stdlib module).

Thanks for reminding me of this. I was planning to take the laziest route 
possible anyway, reusing as much code from cpython as possible.

I will probably start with a minimal implementation of these base classes 
https://github.com/python/cpython/blob/master/Lib/_compression.py using cython, 
using the gzip implementation in cpython as guidance.

--

___
Python tracker 
<https://bugs.python.org/issue41566>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)

2020-08-20 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

> Within the stdlib, I'd focus only on using things that can be used in a 100% 
> api compatible way with the existing modules.

> Otherwise creating a new module and putting it up on PyPI to expose the 
> functionality from the libraries you want makes sense and will be easier to 
> make available to everyone on existing Python versions rather than waiting on 
> getting something into CPython.

Agreed. 100% backwards compatibility is more important than speed. And getting 
it available to users is faster as a module than implementing it in CPython. 

I already have a PR open in the xopen module to implement the use of the igzip 
program. Implementing a module like the gzip module in CPython will be more 
work, but I will certainly consider it. Thanks for the suggestion! 

> There is a caveat to using any of these: how well maintained and security 
> vetted are all of the code paths in the implementation?  zlib proper gets 
> massive security attention.  Its low rate of change and staleness are a 
> feature.

I didn't consider that. So I looked at the CVE page for ZLIB. The latest issues 
are from 2017. Before that 2005. This is the 2017 report: 
https://wiki.mozilla.org/images/0/09/Zlib-report.pdf. 
Note how it states that the old compiler support etc. are a basis for 
vulnerabilities. Precisely zlib-ng did get rid of these parts. On the other 
hand, Mozilla notes that Zlib is a great candidate for periodic security 
audits, precisely for the same reasons you mention.

> FWIW I tend to avoid software provided by Intel given any other choice.

I know the feeling. They rightfully have a very bad reputation for things they 
did in the past. But this particular code is open source and compilable with 
open source compilers. Part of it is written in Assembly, to get the speed 
advantage. I benchmarked it on my AMD processor and I too get enormous speed 
advantages out of it.

>  even less about Intels self serving arm-ignoring oddity.

They *do* provide instructions to build for arm. Right on their README. 
https://github.com/intel/isa-l#building-isa-l. I think it is very unfair to be 
so dismissive just because Intel pays the developers. This is good work, which 
speeds up bioinformatics workloads, which in turn helps us to help more 
patients.

On the whole I think the arguments to make a module are very strong. So I think 
that is the appropriate way forward. I'd love everyone to switch to more 
efficient deflate algorithms, but CPython may not be the right project to drive 
this change. At least this discussion is now here as a reference for other 
people who are curious about improving this performance aspect.

--

___
Python tracker 
<https://bugs.python.org/issue41566>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41586] Allow to set pipe size on subprocess.Popen.

2020-08-19 Thread Ruben Vorderman



Change by Ruben Vorderman :


--
keywords: +patch
pull_requests: +21035
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/21921

___
Python tracker 
<https://bugs.python.org/issue41586>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41586] Allow to set pipe size on subprocess.Popen.

2020-08-18 Thread Ruben Vorderman



New submission from Ruben Vorderman :

Pipes block if reading from an empty pipe or when writing to a full pipe. When 
this happens the program waiting for the pipe still uses a lot of CPU cycles 
when waiting for the pipe to stop blocking.

I found this while working with xopen. A library that pipes data into an 
external gzip process. (This is more efficient than using python's gzip module, 
because the subprocess escapes the GIL, so your main algorithm can fully 
utilize one CPU core while the compression is offloaded to another).

It turns out that increasing the pipe size on Linux from the default of 64KB to 
the maximum allowed pipe size in /proc/sys/fs/max-pipe-size (1024KB) 
drastically improves performance: https://github.com/marcelm/xopen/issues/35. 
TLDR: full utilization of CPU cores, a 40%+ decrease in wall-clock time and a 
20% decrease in the number of compute seconds (indicating that 20% was wasted 
waiting on blocking pipes).

However, doing this with subprocess is quite involved as it is now.

1. You have to find out which constants to use in fcntl for setting the 
pipesize (these constants are not in python). 
2. You have to start the Popen process with routing stdout to subprocess.Pipe. 
3. You have to get my_popen_process.stdout.fileno() 
4. Use fcntl.fcntl to modify the pipe size.

It would be much easier to do `subprocess.Popen(args, pipesize=1024 *1024)` for 
example.

I am currently working on a PR implementing this. It will also make 
F_GETPIPE_SZ and F_SETPIPE_SZ available to the fcntl module.

--
components: Library (Lib)
messages: 375636
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: Allow to set pipe size on subprocess.Popen.
versions: Python 3.10

___
Python tracker 
<https://bugs.python.org/issue41586>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)

2020-08-18 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

I just find out that libdeflate does not support streaming: 
https://github.com/ebiggers/libdeflate/issues/73 . I should have read the 
manual better.

So that explains the memory usage. Because of that I don't think it is suitable 
for usage in CPython.

--

___
Python tracker 
<https://bugs.python.org/issue41566>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)

2020-08-18 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

> That might be an option then. CPython could use the existing library if it is 
> available.

Dynamic linking indeed seems like a great option here! Users who care about 
this will probably have the 'isal' and 'libdeflateO' packages installed on 
their machine anyway. I packaged isa-l on conda-forge, so it is available via 
the anaconda/miniconda installer.

--

___
Python tracker 
<https://bugs.python.org/issue41566>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)

2020-08-17 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

nasm or yasm will work. I only have experience building it with nasm.

But yes that is indeed a dependency. Personally I do not see the problem with 
adding nasm as a build dependency, as it opens up possibilities for even more 
performance optimizations in python by moving parts of the code to Assembly. 
But I can imagine that there might be complications with updating the build 
system.

Libdeflate does not have this problem as it entirely in C. So it could be 
interesting to use that. I think the unfortunate use of memory is due to the 
libdeflate-gzip coding, and not necessarily because of the library. Samtools is 
a tool that uses libdeflate library to great effect and its memory usage is 
fine.

> No, the bug tracker seems fine for this.

Well one thing is that libdeflate and isa-l use different compression ratio's 
for the levels. isa-l does not support anything above 3. libdeflate supports 
1-12. So it is not a one-to-one mapping.

--

___
Python tracker 
<https://bugs.python.org/issue41566>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)

2020-08-17 Thread Ruben Vorderman



Ruben Vorderman  added the comment:

This has to be in a PEP. I am sorry I missplaced it on the bugtracker.

--
resolution:  -> not a bug
stage:  -> resolved
status: open -> closed

___
Python tracker 
<https://bugs.python.org/issue41566>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue41566] Include much faster DEFLATE implementations in Python's gzip and zlib libraries. (isa-l)

2020-08-17 Thread Ruben Vorderman



New submission from Ruben Vorderman :

The gzip file format is quite ubiquitous and so is its first (?) free/libre 
implementation zlib with the gzip command line tool. This uses the DEFLATE 
algorithm.

Lately some faster algorithms (most notable zstd) have popped up which have 
better speed and compression ratio vs zlib. Unfortunately switching over to 
zstd will not be seemless. It is not compatible with zlib/gzip in any way. 

Luckily some developers have tried to implement DEFLATE in a faster way. Most 
notably libdeflate (https://github.com/ebiggers/libdeflate) and Intel's storage 
acceleration library (https://github.com/intel/isa-l).

These libraries provide the libdeflate-gzip and igzip utilities respectively. 
These can compress and decompress the same gzip files. An igzip compressed file 
can be read with gzip and vice versa.

To give an idea of the speed improvements that can be obtained. Here are some 
benchmarks. All benchmarks were done using hyperfine 
(https://github.com/sharkdp/hyperfine). The system was a Ryzen 5 3600 with 
2x16GB DDR4-3200 memory. Operating system Debian 10. All benchmarks were 
performed on a tmpfs which lives in memory to prevent IO bottlenecks. The test 
file was a 5 million read FASTQ file of 1.6 GB 
(https://en.wikipedia.org/wiki/FASTQ_format). These type of files are common in 
bioinformatics at 100+ GB sizes so are a good real-world benchmark.

I benchmarked pigz on one thread as well, as it implements zlib but in a faster 
way than gzip. Zstd was benchmarked as a comparison.

Versions: 
gzip 1.9 (provided by debian)
pigz 2.4 (provided by debian)
igzip 2.25.0 (provided by debian)
libdeflate-gzip 1.6 (compiled by conda-build with the recipe here: 
https://github.com/conda-forge/libdeflate-feedstock/pull/4)
zstd 1.3.8 (provided by debian)

By default level 1 is chosen for all compression benchmarks. Time is average 
over 10 runs.

COMPRESSION
programtime   size   memory
gzip   23.5 seconds   657M   1.5M
pigz (one thread)  22.2 seconds   658M   2.4M
libdeflate-gzip10.1 seconds   623M   1.6G (reads entire file in memory)
igzip  4.6 seconds620M   3.5M
zstd (to .zst) 6.1 seconds584M   12.1M

Decompression. All programs decompressed the file created using gzip -1. (Even 
zstd which can also decompress gzip).

DECOMPRESSION
programtime   memory
gzip   10.5 seconds   744K
pigz (one-thread)  6.7 seconds1.2M
libdeflate-gzip3.6 seconds2.2G (reads in mem before writing)
igzip  3.3 seconds3.6M
zstd (from .gz)6.4 seconds2.2M
zstd (from .zst)   2.3 seconds3.1M

As shown from the above benchmarks, using Intel's Storage Acceleration 
Libraries may improve performance quite substantially. Offering very fast 
compression and decompression. This gets igzip in the zstd ballpark in terms of 
speed while still offering backwards compatibility with gzip.

Intel's Storage Acceleration Libraries (isa-l) come with a bsd-3-clause 
license, so there should be no licensing issues when using that code inside of 
CPython.

--
components: Library (Lib)
messages: 375533
nosy: rhpvorderman
priority: normal
severity: normal
status: open
title: Include much faster DEFLATE implementations in Python's gzip and zlib 
libraries. (isa-l)
versions: Python 3.10

___
Python tracker 
<https://bugs.python.org/issue41566>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

62 matches

Mail list logo