[ https://issues.apache.org/jira/browse/BEAM-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948793#comment-16948793 ]
Kenneth Knowles commented on BEAM-8168: --------------------------------------- [~chamikara] do you know about this? > Python GCSFileSystem failing with gzip content encoding > ------------------------------------------------------- > > Key: BEAM-8168 > URL: https://issues.apache.org/jira/browse/BEAM-8168 > Project: Beam > Issue Type: Bug > Components: io-py-gcp > Affects Versions: 2.15.0 > Reporter: Daniel Ecer > Priority: Major > > Google Storage supports gzip content encoding. > > While Apache Beam (Python) can correctly work with .gz files without content > encoding. > It however fails to handle .gz files that have content encoding applied. > e.g. (the following would work run in a Jupyer notebook) > {code:python} > file_url_1 = 'gs://some-bucket/test1.gz' > file_url_2 = 'gs://some-bucket/test2.gz' > !echo 'my content' > /tmp/test > # file 1 without content encoding > !cat /tmp/test | gzip | gsutil cp - "{file_url_1}" > # file 2 with content encoding > !gsutil cp -Z /tmp/test "{file_url_2}" > !gsutil cat "{file_url_1}" | zcat - > # output: my content > !gsutil cat "{file_url_2}" | zcat - > # output: my content > import apache_beam as beam > from apache_beam.io.filesystem import CompressionTypes > from apache_beam.io.filesystems import FileSystems > print(beam.__version__) > # output: 2.15.0 > with FileSystems.open(file_url_1, > compression_type=CompressionTypes.UNCOMPRESSED) as fp: > print(fp.read(10)) > # output: b'\x1f\x8b\x08\x00\x10\xd6r]\x00\x03' > with FileSystems.open(file_url_1) as fp: > print(fp.read(10)) > # output: b'my content' > with FileSystems.open(file_url_2, > compression_type=CompressionTypes.UNCOMPRESSED) as fp: > print(fp.read(10)) > # output: b'my content' > # (here I would expect the gzipped byte code) > with FileSystems.open(file_url_2) as fp: > print(fp.read(10)) > # exception: FailedToDecompressContent: Content purported to be compressed > with gzip but failed to decompress. > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)