[ 
https://issues.apache.org/jira/browse/BEAM-8168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16948993#comment-16948993
 ] 

Chamikara Madhusanka Jayalath commented on BEAM-8168:
-----------------------------------------------------

Yeah, this is a known issue. See https://issues.apache.org/jira/browse/BEAM-1874

 

BTW why do you need to set content-encoding for these files ? GCS tries to 
automatically decompress these files and Beam Python does not properly handle 
this (and we've run into data loss issues in Java previously for such files). 
It's much safer (and most probably enough) to just store files as gzip 
(content-type: gzip) and let Beam unzip when reading.

 

[1] [https://cloud.google.com/storage/docs/transcoding]

> Python GCSFileSystem failing with gzip content encoding
> -------------------------------------------------------
>
>                 Key: BEAM-8168
>                 URL: https://issues.apache.org/jira/browse/BEAM-8168
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-gcp
>    Affects Versions: 2.15.0
>            Reporter: Daniel Ecer
>            Priority: Major
>
> Google Storage supports gzip content encoding.
>  
> While Apache Beam (Python) can correctly work with .gz files without content 
> encoding.
> It however fails to handle .gz files that have content encoding applied.
> e.g. (the following would work run in a Jupyer notebook)
> {code:python}
> file_url_1 = 'gs://some-bucket/test1.gz'
> file_url_2 = 'gs://some-bucket/test2.gz'
> !echo 'my content' > /tmp/test
> # file 1 without content encoding
> !cat /tmp/test | gzip | gsutil cp - "{file_url_1}"
> # file 2 with content encoding
> !gsutil cp -Z /tmp/test "{file_url_2}"
> !gsutil cat "{file_url_1}" | zcat -
> # output: my content
> !gsutil cat "{file_url_2}" | zcat -
> # output: my content
> import apache_beam as beam
> from apache_beam.io.filesystem import CompressionTypes
> from apache_beam.io.filesystems import FileSystems
> print(beam.__version__)
> # output: 2.15.0
> with FileSystems.open(file_url_1, 
> compression_type=CompressionTypes.UNCOMPRESSED) as fp:
>     print(fp.read(10))
> # output: b'\x1f\x8b\x08\x00\x10\xd6r]\x00\x03'
> with FileSystems.open(file_url_1) as fp:
>     print(fp.read(10))
> # output: b'my content'
> with FileSystems.open(file_url_2, 
> compression_type=CompressionTypes.UNCOMPRESSED) as fp:
>     print(fp.read(10))
> # output: b'my content'
> # (here I would expect the gzipped byte code)
> with FileSystems.open(file_url_2) as fp:
>     print(fp.read(10))
> # exception: FailedToDecompressContent: Content purported to be compressed 
> with gzip but failed to decompress.
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to