[ 
https://issues.apache.org/jira/browse/BEAM-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948793#comment-16948793
 ] 

Kenneth Knowles commented on BEAM-8168:
---------------------------------------

[~chamikara] do you know about this?

> Python GCSFileSystem failing with gzip content encoding
> -------------------------------------------------------
>
>                 Key: BEAM-8168
>                 URL: https://issues.apache.org/jira/browse/BEAM-8168
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-gcp
>    Affects Versions: 2.15.0
>            Reporter: Daniel Ecer
>            Priority: Major
>
> Google Storage supports gzip content encoding.
>  
> While Apache Beam (Python) can correctly work with .gz files without content 
> encoding.
> It however fails to handle .gz files that have content encoding applied.
> e.g. (the following would work run in a Jupyer notebook)
> {code:python}
> file_url_1 = 'gs://some-bucket/test1.gz'
> file_url_2 = 'gs://some-bucket/test2.gz'
> !echo 'my content' > /tmp/test
> # file 1 without content encoding
> !cat /tmp/test | gzip | gsutil cp - "{file_url_1}"
> # file 2 with content encoding
> !gsutil cp -Z /tmp/test "{file_url_2}"
> !gsutil cat "{file_url_1}" | zcat -
> # output: my content
> !gsutil cat "{file_url_2}" | zcat -
> # output: my content
> import apache_beam as beam
> from apache_beam.io.filesystem import CompressionTypes
> from apache_beam.io.filesystems import FileSystems
> print(beam.__version__)
> # output: 2.15.0
> with FileSystems.open(file_url_1, 
> compression_type=CompressionTypes.UNCOMPRESSED) as fp:
>     print(fp.read(10))
> # output: b'\x1f\x8b\x08\x00\x10\xd6r]\x00\x03'
> with FileSystems.open(file_url_1) as fp:
>     print(fp.read(10))
> # output: b'my content'
> with FileSystems.open(file_url_2, 
> compression_type=CompressionTypes.UNCOMPRESSED) as fp:
>     print(fp.read(10))
> # output: b'my content'
> # (here I would expect the gzipped byte code)
> with FileSystems.open(file_url_2) as fp:
>     print(fp.read(10))
> # exception: FailedToDecompressContent: Content purported to be compressed 
> with gzip but failed to decompress.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to