[ https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060219#comment-16060219 ]
Guillermo RodrÃguez Cano commented on BEAM-2490: ------------------------------------------------ Thanks [~altay]. My bad, I did not notice that till now. I tested it again and indeed it was that. > ReadFromText function is not taking all data with glob operator (*) > -------------------------------------------------------------------- > > Key: BEAM-2490 > URL: https://issues.apache.org/jira/browse/BEAM-2490 > Project: Beam > Issue Type: Bug > Components: sdk-py > Affects Versions: 2.0.0 > Environment: Usage with Google Cloud Platform: Dataflow runner > Reporter: Olivier NGUYEN QUOC > Assignee: Chamikara Jayalath > Fix For: 2.1.0 > > > I run a very simple pipeline: > * Read my files from Google Cloud Storage > * Split with '\n' char > * Write in on a Google Cloud Storage > I have 8 files that match with the pattern: > * my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB) > * my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB) > * my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB) > * my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB) > * my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB) > * my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB) > * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB) > * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB) > This code should take them all: > {code:python} > beam.io.ReadFromText( > "gs://XXXX_folder1/my_files_20160901*.csv.gz", > skip_header_lines=1, > compression_type=beam.io.filesystem.CompressionTypes.GZIP > ) > {code} > It runs well but there is only a 288.62 MB file in output of this pipeline > (instead of a 1.5 GB file). > The whole pipeline code: > {code:python} > data = (p | 'ReadMyFiles' >> beam.io.ReadFromText( > "gs://XXXX_folder1/my_files_20160901*.csv.gz", > skip_header_lines=1, > compression_type=beam.io.filesystem.CompressionTypes.GZIP > ) > | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n')) > ) > output = ( > data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', > num_shards=1) > ) > {code} > Dataflow indicates me that the estimated size of the output after the > ReadFromText step is 602.29 MB only, which not correspond to any unique input > file size nor the overall file size matching with the pattern. -- This message was sent by Atlassian JIRA (v6.4.14#64029)