[ https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16103390#comment-16103390 ]
Chamikara Jayalath commented on BEAM-2490: ------------------------------------------ To clarify, you are saying that performance is too slow when using DirectRunner (and that you cannot complete your experiment due to that) not that you are observing loss of data when dunning at HEAD, right ? Performance issue is captured by https://issues.apache.org/jira/browse/BEAM-2531 and I hope to look into it in the near future. You should be able to complete to experiment using DataflowRunner. Did you specify option --sdk_location <your apache-beam-2.2.0.dev0.tar.gz> when running with DataflowRunner ? To build the tar.gz file use the following command. cd beam/sdks/python python setup.py sdist > ReadFromText function is not taking all data with glob operator (*) > -------------------------------------------------------------------- > > Key: BEAM-2490 > URL: https://issues.apache.org/jira/browse/BEAM-2490 > Project: Beam > Issue Type: Bug > Components: sdk-py > Affects Versions: 2.0.0 > Environment: Usage with Google Cloud Platform: Dataflow runner > Reporter: Olivier NGUYEN QUOC > Assignee: Chamikara Jayalath > Fix For: Not applicable > > > I run a very simple pipeline: > * Read my files from Google Cloud Storage > * Split with '\n' char > * Write in on a Google Cloud Storage > I have 8 files that match with the pattern: > * my_files_2016090116_20160902_060051_xxxxxxxxxx.csv.gz (229.25 MB) > * my_files_2016090117_20160902_060051_xxxxxxxxxx.csv.gz (184.1 MB) > * my_files_2016090118_20160902_060051_xxxxxxxxxx.csv.gz (171.73 MB) > * my_files_2016090119_20160902_060051_xxxxxxxxxx.csv.gz (151.34 MB) > * my_files_2016090120_20160902_060051_xxxxxxxxxx.csv.gz (129.69 MB) > * my_files_2016090121_20160902_060051_xxxxxxxxxx.csv.gz (151.7 MB) > * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (346.46 MB) > * my_files_2016090122_20160902_060051_xxxxxxxxxx.csv.gz (222.57 MB) > This code should take them all: > {code:python} > beam.io.ReadFromText( > "gs://XXXX_folder1/my_files_20160901*.csv.gz", > skip_header_lines=1, > compression_type=beam.io.filesystem.CompressionTypes.GZIP > ) > {code} > It runs well but there is only a 288.62 MB file in output of this pipeline > (instead of a 1.5 GB file). > The whole pipeline code: > {code:python} > data = (p | 'ReadMyFiles' >> beam.io.ReadFromText( > "gs://XXXX_folder1/my_files_20160901*.csv.gz", > skip_header_lines=1, > compression_type=beam.io.filesystem.CompressionTypes.GZIP > ) > | 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n')) > ) > output = ( > data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', > num_shards=1) > ) > {code} > Dataflow indicates me that the estimated size of the output after the > ReadFromText step is 602.29 MB only, which not correspond to any unique input > file size nor the overall file size matching with the pattern. -- This message was sent by Atlassian JIRA (v6.4.14#64029)