[ 
https://issues.apache.org/jira/browse/BEAM-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225122#comment-15225122
 ] 

Daniel Halperin commented on BEAM-167:
--------------------------------------

Thanks Eugene!

> TextIO can't read concatenated gzip files
> -----------------------------------------
>
>                 Key: BEAM-167
>                 URL: https://issues.apache.org/jira/browse/BEAM-167
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-java-extensions
>            Reporter: Eugene Kirpichov
>            Assignee: Luke Cwik
>
> $ cat <<END > header.csv
> a,b,c
> END
> $ cat <<END > body.csv
> 1,2,3
> 4,5,6
> 7,8,9
> END
> $ gzip -c header.csv > file.gz
> $ gzip -c body.csv >> file.gz
> The file is well-formed:
> $ gzip -dc file.gz
> a,b,c
> 1,2,3
> 4,5,6
> 7,8,9
> However, TextIO.Read.from("/path/to/file.gz") will read only "a,b,c" - 
> reproducible even when the file is on local disk and with the 
> DirectPipelineRunner.
> The bug is in CompressedSource. It uses GzipCompressorInputStream, which by 
> default reads only the first gzip stream in the file, but has an option to 
> read all of them. Previously (in Dataflow SDK 1.4.0) we used GZIPInputStream 
> which reads all streams.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to