[jira] [Commented] (BEAM-1251) Python 3 Support

2018-05-22 Thread Jonathan Delfour (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483992#comment-16483992
 ] 

Jonathan Delfour commented on BEAM-1251:


I want to add that Apache Airflow 1.9.0 now runs on python 3 and provides a 
Dataflow Python operator that only works in Python 2 because apache beam only 
supports Python 2.
That means we _can't easily_ use apache beam anymore using latest version of 
airflow...

> Python 3 Support
> 
>
> Key: BEAM-1251
> URL: https://issues.apache.org/jira/browse/BEAM-1251
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Eyad Sibai
>Priority: Trivial
>  Time Spent: 8h 50m
>  Remaining Estimate: 0h
>
> I have been trying to use google datalab with python3. As I see there are 
> several packages that does not support python3 yet which google datalab 
> depends on. This is one of them.
> https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/6



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (BEAM-3757) Shuffle read failed using python 2.2.0

2018-03-02 Thread Jonathan Delfour (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383846#comment-16383846
 ] 

Jonathan Delfour commented on BEAM-3757:


I have created a support ticket with google team, they confirm it is not 
isolated and are investigating.
Feel free to close this one, i believe they will contact you should there be an 
issue with beam itself.

> Shuffle read failed using python 2.2.0
> --
>
> Key: BEAM-3757
> URL: https://issues.apache.org/jira/browse/BEAM-3757
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Affects Versions: 2.2.0
> Environment: gcp, macos
>Reporter: Jonathan Delfour
>Assignee: Thomas Groh
>Priority: Major
>
> Hi,
> First issue is that the beam 2.3.0 python SDK is apparently not working on 
> GCP. It gets stuck: 
> {noformat}
> Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be 
> stuck. You can get help with Cloud Dataflow at 
> https://cloud.google.com/dataflow/support. 
> {noformat}
> I tried two times.
> Reverting back to 2.2.0: it usually works but today, after > 1 hour of 
> processing, and 30 workers used, I get a failure with these in the logs:
> {noformat}
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
> 582, in do_work
> work_executor.execute()
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
> line 167, in execute
> op.start()
>   File "dataflow_worker/shuffle_operations.py", line 49, in 
> dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
> def start(self):
>   File "dataflow_worker/shuffle_operations.py", line 50, in 
> dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
> with self.scoped_start_state:
>   File "dataflow_worker/shuffle_operations.py", line 65, in 
> dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
> with self.shuffle_source.reader() as reader:
>   File "dataflow_worker/shuffle_operations.py", line 67, in 
> dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
> for key_values in reader:
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
> line 406, in __iter__
> for entry in entries_iterator:
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
> line 248, in next
> return next(self.iterator)
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
> line 206, in __iter__
> chunk, next_position = self.reader.Read(start_position, end_position)
>   File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in 
> shuffle_client.PyShuffleReader.Read
> IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2 
>  talking to my-dataflow-02271107-756f-harness-2p65:12346
> {noformat}
> i also get some information message:
> {noformat}
> Refusing to split  at 0x7f03a00fe790> at '\x00\x00\x00\t\x1d\x14\x87\xa3\x00\x01': unstarted
> {noformat}
> For the flow, I am extracting data from BQ, cleaning using pandas, exporting 
> as a csv file, gzipping and uploading the compressed file to a bucket using 
> decompressive transcoding (csv export, gzip compression and upload are in the 
> same 'worker' as they are done in the same beam.DoFn).
> PS: i can't find a reasonable way to export the logs from GCP but i can 
> privately send the log file i have of the run on my machine (the log of the 
> pipeline)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (BEAM-3757) Shuffle read failed using python 2.2.0

2018-02-27 Thread Jonathan Delfour (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Delfour updated BEAM-3757:
---
Description: 
Hi,

First issue is that the beam 2.3.0 python SDK is apparently not working on GCP. 
It gets stuck: 
{noformat}
Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be stuck. 
You can get help with Cloud Dataflow at 
https://cloud.google.com/dataflow/support. 
{noformat}
I tried two times.

Reverting back to 2.2.0: it usually works but today, after > 1 hour of 
processing, and 30 workers used, I get a failure with these in the logs:

{noformat}
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", 
line 582, in do_work
work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
line 167, in execute
op.start()
  File "dataflow_worker/shuffle_operations.py", line 49, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
def start(self):
  File "dataflow_worker/shuffle_operations.py", line 50, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
with self.scoped_start_state:
  File "dataflow_worker/shuffle_operations.py", line 65, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
with self.shuffle_source.reader() as reader:
  File "dataflow_worker/shuffle_operations.py", line 67, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
for key_values in reader:
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
line 406, in __iter__
for entry in entries_iterator:
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
line 248, in next
return next(self.iterator)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
line 206, in __iter__
chunk, next_position = self.reader.Read(start_position, end_position)
  File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in 
shuffle_client.PyShuffleReader.Read
IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2  
talking to my-dataflow-02271107-756f-harness-2p65:12346
{noformat}

i also get some information message:

{noformat}
Refusing to split  at '\x00\x00\x00\t\x1d\x14\x87\xa3\x00\x01': unstarted
{noformat}

For the flow, I am extracting data from BQ, cleaning using pandas, exporting as 
a csv file, gzipping and uploading the compressed file to a bucket using 
decompressive transcoding (csv export, gzip compression and upload are in the 
same 'worker' as they are done in the same beam.DoFn).

PS: i can't find a reasonable way to export the logs from GCP but i can 
privately send the log file i have of the run on my machine (the log of the 
pipeline)


  was:
Hi,

First issue is that the beam 2.3.0 python SDK is apparently not working on GCP. 
It gets stuck: 
{noformat}
Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be stuck. 
You can get help with Cloud Dataflow at 
https://cloud.google.com/dataflow/support. 
{noformat}
I tried two times.

Reverting back to 2.2.0: it usually works but today, after > 1 hour of 
processing, and 30 workers used, I get a failure with these in the logs:

{noformat}
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", 
line 582, in do_work
work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
line 167, in execute
op.start()
  File "dataflow_worker/shuffle_operations.py", line 49, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
def start(self):
  File "dataflow_worker/shuffle_operations.py", line 50, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
with self.scoped_start_state:
  File "dataflow_worker/shuffle_operations.py", line 65, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
with self.shuffle_source.reader() as reader:
  File "dataflow_worker/shuffle_operations.py", line 67, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
for key_values in reader:
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
line 406, in __iter__
for entry in entries_iterator:
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
line 248, in next
return next(self.iterator)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
line 206, in __iter__
chunk, next_position = self.reader.Read(start_position, end_position)
  File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in 
shuffle_client.PyShuffleReader.Read
IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2  
talking to my-dataflow-02271107-756f-harness-2p65:12346

[jira] [Created] (BEAM-3757) Shuffle read failed using python 2.2.0

2018-02-27 Thread Jonathan Delfour (JIRA)
Jonathan Delfour created BEAM-3757:
--

 Summary: Shuffle read failed using python 2.2.0
 Key: BEAM-3757
 URL: https://issues.apache.org/jira/browse/BEAM-3757
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow
Affects Versions: 2.2.0
 Environment: gcp, macos
Reporter: Jonathan Delfour
Assignee: Thomas Groh


Hi,

First issue is that the beam 2.3.0 python SDK is apparently not working on GCP. 
It gets stuck: 
{noformat}
Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be stuck. 
You can get help with Cloud Dataflow at 
https://cloud.google.com/dataflow/support. 
{noformat}
I tried two times.

Reverting back to 2.2.0: it usually works but today, after > 1 hour of 
processing, and 30 workers used, I get a failure with these in the logs:

{noformat}
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", 
line 582, in do_work
work_executor.execute()
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
line 167, in execute
op.start()
  File "dataflow_worker/shuffle_operations.py", line 49, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
def start(self):
  File "dataflow_worker/shuffle_operations.py", line 50, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
with self.scoped_start_state:
  File "dataflow_worker/shuffle_operations.py", line 65, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
with self.shuffle_source.reader() as reader:
  File "dataflow_worker/shuffle_operations.py", line 67, in 
dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
for key_values in reader:
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
line 406, in __iter__
for entry in entries_iterator:
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
line 248, in next
return next(self.iterator)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", 
line 206, in __iter__
chunk, next_position = self.reader.Read(start_position, end_position)
  File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in 
shuffle_client.PyShuffleReader.Read
IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2  
talking to my-dataflow-02271107-756f-harness-2p65:12346
{noformat}

i also get some information message:

{noformat}
Refusing to split  at '\x00\x00\x00\t\x1d\x14\x87\xa3\x00\x01': unstarted
{noformat}

For the flow, I am extracting data from BQ, cleaning using pandas, exporting as 
a csv file, gzipping and uploading the compressed file to a bucket using 
decompressive transcoding (csv export, gzip compression and upload are in the 
same 'worker' as they are done in the same beam.DoFn).




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)