[ https://issues.apache.org/jira/browse/BEAM-167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15225122#comment-15225122 ]
Daniel Halperin commented on BEAM-167: -------------------------------------- Thanks Eugene! > TextIO can't read concatenated gzip files > ----------------------------------------- > > Key: BEAM-167 > URL: https://issues.apache.org/jira/browse/BEAM-167 > Project: Beam > Issue Type: Bug > Components: sdk-java-extensions > Reporter: Eugene Kirpichov > Assignee: Luke Cwik > > $ cat <<END > header.csv > a,b,c > END > $ cat <<END > body.csv > 1,2,3 > 4,5,6 > 7,8,9 > END > $ gzip -c header.csv > file.gz > $ gzip -c body.csv >> file.gz > The file is well-formed: > $ gzip -dc file.gz > a,b,c > 1,2,3 > 4,5,6 > 7,8,9 > However, TextIO.Read.from("/path/to/file.gz") will read only "a,b,c" - > reproducible even when the file is on local disk and with the > DirectPipelineRunner. > The bug is in CompressedSource. It uses GzipCompressorInputStream, which by > default reads only the first gzip stream in the file, but has an option to > read all of them. Previously (in Dataflow SDK 1.4.0) we used GZIPInputStream > which reads all streams. -- This message was sent by Atlassian JIRA (v6.3.4#6332)