[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-07-27 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103642#comment-16103642
 ] 

Chamikara Jayalath commented on BEAM-2490:
--

Great. I'll go ahead and close this issue.

It's great if you can help with 
https://issues.apache.org/jira/browse/BEAM-2531. Lemme know if you want that 
bug assigned to you.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-07-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103624#comment-16103624
 ] 

Guillermo Rodríguez Cano commented on BEAM-2490:


Thanks [~chamikara] for the tip with the option, and yes, I meant that the 
performance was so slow with the DirectRunner that I would never be able to see 
the job finish (actually my test with just one compressed file took 'only' four 
hours after all...).
It is quite a discovery the mentioned parameter as I haven't seen it anywhere 
and I was trying to package the Beam the documented way but it was not getting 
accepted. Thanks again.

Now, with the latest Beam's HEAD (as of this writing) I can safely confirm that 
the glob operator works.
I tried the same compressed files I have mentioned before (first just one, 
though still using the glob operator, and then the seven aforementioned files) 
and they got fully processed, and extremely fast (I am not sure why such 
performance's difference with my laptop as Dataflow is autoscaling to two 
workers of n1-standard-4 type).
Also, for the sake of the upcoming Beam's release (2.1.0), I did the same tests 
with the current release candidate (RC2), with the same successful results.

So this issue can be closed :)

The performance issue with the DirectRunner can be addressed in the 
corresponding issue then (I will help with that if needed, not sure if I can 
share the data [~altay] even with some anonymization on our end, but I will 
ask...).


> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-07-27 Thread Ahmet Altay (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103397#comment-16103397
 ] 

Ahmet Altay commented on BEAM-2490:
---

[~wileeam] could you share your dataset? We could try reproducing with the same 
pipeline/data on DataflowRunner?

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-07-27 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103390#comment-16103390
 ] 

Chamikara Jayalath commented on BEAM-2490:
--

To clarify, you are saying that performance is too slow when using DirectRunner 
(and that you cannot complete your experiment due to that) not that you are 
observing loss of data when dunning at HEAD, right ? Performance issue is 
captured by https://issues.apache.org/jira/browse/BEAM-2531 and I hope to look 
into it in the near future.

You should be able to complete to experiment using DataflowRunner. Did you 
specify option --sdk_location  when running 
with DataflowRunner ? To build the tar.gz file use the following command.

cd beam/sdks/python
python setup.py sdist



> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-07-27 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16103045#comment-16103045
 ] 

Guillermo Rodríguez Cano commented on BEAM-2490:


Hello again, and a quick update,

* OS: Mac OS X Sierra 10.12.6
* Apache Beam: 2.2.0dev (aka HEAD at master branch as of 8 hours ago...)
* Python: 2.7.13 
* Runner: DirectRunner (so far given the "results")

I ran pretty much the same experiment I ran at the end of June (described here: 
https://issues.apache.org/jira/browse/BEAM-2490?focusedCommentId=16063224) with 
the latest version as of the HEAD in the master branch of the Apache Beam 
repository and unfortunately the results are the same so far: no results.

My laptop was running this all night and after 8 hours it is still not finished 
(for a 'job' of 8 files gzipped JSON files of 200-300 MB compressed) and no 
output. I also ran the same experiment with only one file in the subdirectory 
where I use the operator, and it is still running although I got some output 
but I don't think that it is ok that it takes more than 3 hours to process just 
one file...
Since these tests haven't finished I couldn't test on DataFlow yet (besides I 
still haven't figured out how to package the HEAD or a tag for that matter of 
beam for DataFlow. No matter how I try, I always get something along this line: 
`Could not find a version that satisfies the requirement apache-beam==2.1.0 
(from versions: 0.6.0, 2.0.0)` Suggestions?).

So I can't confirm that this issue is really resolved unfortunately. I don't 
think this issue is related to https://issues.apache.org/jira/browse/BEAM-2497 
and more to https://issues.apache.org/jira/browse/BEAM-2531
I suspect all files are read (hence it is likely that the glob operator works) 
but due to the performance of the decompression we don't know that for sure.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-07-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102436#comment-16102436
 ] 

Guillermo Rodríguez Cano commented on BEAM-2490:


Sorry for the delay, I have been busy with other stuff at work and couldn't 
test this when I wanted.
Will do test tomorrow morning and get back asap.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-07-25 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100536#comment-16100536
 ] 

Chamikara Jayalath commented on BEAM-2490:
--

I suspect this was due to https://issues.apache.org/jira/browse/BEAM-2497. I 
think we can close this if this issue could not be reproduced with the fix 
(https://github.com/apache/beam/pull/3428) applied.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-07-25 Thread Ahmet Altay (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16100426#comment-16100426
 ] 

Ahmet Altay commented on BEAM-2490:
---

What are the next steps on this issue? Can we close it?

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069092#comment-16069092
 ] 

Guillermo Rodríguez Cano commented on BEAM-2490:


Awesome. Thanks [~chamikara]!

I guess that this issue is then blocked until the gzip performance issue in 
Python is resolved. I'll have a look at the use of zlib during the weekend, and 
report in the new issue with the findings :)

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-28 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16066824#comment-16066824
 ] 

Chamikara Jayalath commented on BEAM-2490:
--

I filed a separate ticket regarding performance issues mentioned above - 
https://issues.apache.org/jira/browse/BEAM-2531.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-27 Thread Olivier NGUYEN QUOC (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16064540#comment-16064540
 ] 

Olivier NGUYEN QUOC commented on BEAM-2490:
---

I have also tried, on a virtual machine, to run the HEAD with Direct Runner. 
It's also very slower (but I have a very slow virtual machine).

I am currently trying to use the HEAD on Dataflow. I have not succeed in it yet.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-26 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16063224#comment-16063224
 ] 

Guillermo Rodríguez Cano commented on BEAM-2490:


 Hello [~chamikara] and [~altay], and thanks for the comments,

 here you have some details of the setup I have used for the Direct runner so 
far (where apache beam version also applies for the Dataflow runner):
* OS: Mac OS X Sierra 10.12.5 
* Apache Beam: 2.0.0
* Python: 2.7.13

 I tried the HEAD from the official repository (git hash: 
[16f87f49f20796e29d01ed363a9097ea5420583c|https://github.com/apache/beam/tree/16f87f49f20796e29d01ed363a9097ea5420583c])
 as suggested by [~altay] and I cannot conclude yet whether it works or not. It 
seems that gz files are read 'more' than before because there is a higher 
memory usage than when using the current release of Apache Beam (and the amount 
of memory used is comparable to the case when the same non-compressed files are 
processed with the pipeline). However, it is extremely slow (again, with the 
Direct Runner), slower than using the non-compressed files.
Therefore, as a test of the HEAD I am now running only one of those gzip files 
now but the task hasn't completed (maybe then I just discovered some 
performance bug in that fix, https://github.com/apache/beam/pull/3428, because 
it feels very slow...).

I am not sure if this would be faster in GCP but I could try this anyways on 
Dataflow though I am not sure if I can have Dataflow run the HEAD of the 
repository. I tried following the advice on the official 
[documentation|https://cloud.google.com/dataflow/pipelines/dependencies-python] 
but I don't manage to get the repository properly packed for the workers to 
pick it up.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-24 Thread Ahmet Altay (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062089#comment-16062089
 ] 

Ahmet Altay commented on BEAM-2490:
---

gzip issue could be related to https://issues.apache.org/jira/browse/BEAM-2490 
[~wileeam], are you running against head with the fix 
(https://github.com/apache/beam/pull/3428) ?

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-24 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062040#comment-16062040
 ] 

Chamikara Jayalath commented on BEAM-2490:
--

Hi Guillermo, what is the OS and version of Beam you are using ? Thanks.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16062003#comment-16062003
 ] 

Guillermo Rodríguez Cano commented on BEAM-2490:


Hello again,

to corroborate my hypothesis of an issue with the compression I included a 
Metric's counter in the code I have in my pipeline to process each line of each 
input file after it has been read, and then at the end of the pipeline I output 
the metrics. Tests done with the Direct runner though.

I clearly obtain wrong results when using the glob operator in question on gzip 
files and not when using it with the same uncompressed files (as compared to 
running the command-line wc program on the same uncompressed files, which is 
what I am expecting since it is just counting the number of lines of the input).

As an example of the 'loss of information' that happens, see below:

{code:none}
» wc input/shortlist/*
 1878646 12116709 2025493666 input/shortlist/events_20170620_22.json
 1535456 10148786 1678214474 input/shortlist/events_20170620_23.json
 1363218 9181848 1507687080 input/shortlist/events_20170621_00.json
 1281514 8727816 1428512156 input/shortlist/events_20170621_01.json
 1243974 8486567 1387581129 input/shortlist/events_20170621_02.json
 1926283 12525017 1953672115 input/shortlist/events_20170621_12.json
 1915997 12444970 1943291036 input/shortlist/events_20170621_13.json
 11145088 73631713 11924451656 total

» python sessions.py --input './input/shortlist-gzip/*' --output ./output/ 
--runner DirectRunner
No handlers could be found for logger "oauth2client.contrib.multistore_file"
INFO:root:Running pipeline with DirectRunner.
INFO:root:Running pipeline with DirectRunner.
INFO:root:Number of valid events processed: 177902
INFO:root:Number of invalid events seen: 23353

» python sessions.py --input './input/shortlist/*' --output ./output/ --runner 
DirectRunner
No handlers could be found for logger "oauth2client.contrib.multistore_file"
INFO:root:Running pipeline with DirectRunner.
INFO:root:Running pipeline with DirectRunner.
INFO:root:Number of valid events processed: 9737763
INFO:root:Number of invalid events seen: 1407325
{code}

Note how the sum of the two counters in the latter case correspond with the 
output from the wc command.

Looking forward to assisting further to solve/clarify this issue.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: Not applicable
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-23 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16061345#comment-16061345
 ] 

Chamikara Jayalath commented on BEAM-2490:
--

Sounds good. For now I'll remove this from the release 2.1.0 blockers list.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: 2.1.0
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-23 Thread Olivier NGUYEN QUOC (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060478#comment-16060478
 ] 

Olivier NGUYEN QUOC commented on BEAM-2490:
---

I will be able to give some example files next week (not before, sorry).

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: 2.1.0
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-22 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060384#comment-16060384
 ] 

Chamikara Jayalath commented on BEAM-2490:
--

Thanks Guillermo and Ahmet. Guillermo, let us know if you run into other issues.

Olivier, I ran my experiment with an input comparable to yours (8 100MB gzip 
files) both for Beam head and 2.0.0 release. Still not observing any data loss.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: 2.1.0
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060219#comment-16060219
 ] 

Guillermo Rodríguez Cano commented on BEAM-2490:


Thanks [~altay]. My bad, I did not notice that till now. I tested it again and 
indeed it was that. 

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: 2.1.0
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-22 Thread Ahmet Altay (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16060191#comment-16060191
 ] 

Ahmet Altay commented on BEAM-2490:
---

Hi [~wileeam], it looks like your problem is that your shell is expanding the * 
in {{./input/test/*}}. Assuming that you are using argparse in the standard way 
only the first input will be used. Please check that {{known_args.input}} 
contains the unexpanded version of the glob. (Typically you can do this by 
escaping i.e {{python sessions_manager.py --input './input/test/*' --output 
./output/ --runner DirectRunner}}

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
> Fix For: 2.1.0
>
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2490) ReadFromText function is not taking all data with glob operator (*)

2017-06-21 Thread Chamikara Jayalath (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16058426#comment-16058426
 ] 

Chamikara Jayalath commented on BEAM-2490:
--

I ran the same pipeline with a similar but smaller (8 1MB gzip files)  input 
pattern but didn't observe any data loss. It is possible to share your input 
(or a smaller sample of it from which issue can be reproduced) ?

Also can you try running without the 'SplitLines' step. ReadFromText already 
splits input based on end of line characters, so 'SplitLines' should not have 
any effect.

Size estimation, as name suggests, is an estimation, and we use sampling here. 
So there could be variations between estimated size displayed in the step vs 
the actual size observed.

> ReadFromText function is not taking all data with glob operator (*) 
> 
>
> Key: BEAM-2490
> URL: https://issues.apache.org/jira/browse/BEAM-2490
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.0.0
> Environment: Usage with Google Cloud Platform: Dataflow runner
>Reporter: Olivier NGUYEN QUOC
>Assignee: Chamikara Jayalath
>
> I run a very simple pipeline:
> * Read my files from Google Cloud Storage
> * Split with '\n' char
> * Write in on a Google Cloud Storage
> I have 8 files that match with the pattern:
> * my_files_2016090116_20160902_060051_xx.csv.gz (229.25 MB)
> * my_files_2016090117_20160902_060051_xx.csv.gz (184.1 MB)
> * my_files_2016090118_20160902_060051_xx.csv.gz (171.73 MB)
> * my_files_2016090119_20160902_060051_xx.csv.gz (151.34 MB)
> * my_files_2016090120_20160902_060051_xx.csv.gz (129.69 MB)
> * my_files_2016090121_20160902_060051_xx.csv.gz (151.7 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (346.46 MB)
> * my_files_2016090122_20160902_060051_xx.csv.gz (222.57 MB)
> This code should take them all:
> {code:python}
> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
> {code}
> It runs well but there is only a 288.62 MB file in output of this pipeline 
> (instead of a 1.5 GB file).
> The whole pipeline code:
> {code:python}
> data = (p | 'ReadMyFiles' >> beam.io.ReadFromText(
>   "gs://_folder1/my_files_20160901*.csv.gz",
>   skip_header_lines=1,
>   compression_type=beam.io.filesystem.CompressionTypes.GZIP
>   )
>| 'SplitLines' >> beam.FlatMap(lambda x: x.split('\n'))
> )
> output = (
>   data| "Write" >> beam.io.WriteToText('gs://XXX_folder2/test.csv', 
> num_shards=1)
> )
> {code}
> Dataflow indicates me that the estimated size of the output after the 
> ReadFromText step is 602.29 MB only, which not correspond to any unique input 
> file size nor the overall file size matching with the pattern.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)