Bug#874321: backtraces from generic extractor: "Compressed file ended before the end-of-stream marker was reached"

Joey Hess Tue, 24 Oct 2017 15:27:33 -0700

Rogério Brito wrote:
> I believe that you meant to file this as a Python bug and I think that the
> severity is, quite frankly, lower than normal...


I don't think this is a python bug. It's reasonable for pythons's gzip
library to fail when presented with corrupted data. It does not know
it's being used to download an url. Perhaps it should have a mode where
it tries to extract as much data is it can, in case its caller wants to
try to be robust.

I think this is a bug in youtube-dl though, because of this code:

std_headers = {
...
    'Accept-Encoding': 'gzip, deflate',
}

        if resp.headers.get('Content-encoding', '') == 'gzip':
            content = resp.read()
            gz = gzip.GzipFile(fileobj=io.BytesIO(content), mode='rb')
            try:
                uncompressed = io.BytesIO(gz.read())
            except IOError as original_ioerror:
                # There may be junk add the end of the file
                # See http://stackoverflow.com/q/4928560/35070 for details
                for i in range(1, 1024):
                    try:
                        gz = gzip.GzipFile(fileobj=io.BytesIO(content[:-i]), 
mode='rb')
                        uncompressed = io.BytesIO(gz.read())
                    except IOError:
                        continue
                    break
                else:
                    raise original_ioerror

It's encouraging gzip to be used (rather than deflate or no compression),
and it already contains workarounds for similar problems. This code
smells.

There is probably a python library that implements this robustly.
I tried python-urllib3:

joey@darkstar:~>python
Python 2.7.14 (default, Sep 17 2017, 18:50:44) 
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib3
>>> http = urllib3.PoolManager()
>>> headers = {'Accept-Encoding': 'gzip'}
>>> r = http.request('GET', 'http://www.debian.org/', headers=headers)
>>> r.headers.get("Content-Encoding")
'gzip'
>>> len(r.data)
14871

So that seems to work. I think because it uses zlib to decompress the data,
not gzip.

-- 
see shy jo

signature.asc
Description: PGP signature

Bug#874321: backtraces from generic extractor: "Compressed file ended before the end-of-stream marker was reached"

Reply via email to