[jira] [Commented] (BEAM-1251) Python 3 Support
[ https://issues.apache.org/jira/browse/BEAM-1251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16483992#comment-16483992 ] Jonathan Delfour commented on BEAM-1251: I want to add that Apache Airflow 1.9.0 now runs on python 3 and provides a Dataflow Python operator that only works in Python 2 because apache beam only supports Python 2. That means we _can't easily_ use apache beam anymore using latest version of airflow... > Python 3 Support > > > Key: BEAM-1251 > URL: https://issues.apache.org/jira/browse/BEAM-1251 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Eyad Sibai >Priority: Trivial > Time Spent: 8h 50m > Remaining Estimate: 0h > > I have been trying to use google datalab with python3. As I see there are > several packages that does not support python3 yet which google datalab > depends on. This is one of them. > https://github.com/GoogleCloudPlatform/DataflowPythonSDK/issues/6 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (BEAM-3757) Shuffle read failed using python 2.2.0
[ https://issues.apache.org/jira/browse/BEAM-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16383846#comment-16383846 ] Jonathan Delfour commented on BEAM-3757: I have created a support ticket with google team, they confirm it is not isolated and are investigating. Feel free to close this one, i believe they will contact you should there be an issue with beam itself. > Shuffle read failed using python 2.2.0 > -- > > Key: BEAM-3757 > URL: https://issues.apache.org/jira/browse/BEAM-3757 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Affects Versions: 2.2.0 > Environment: gcp, macos >Reporter: Jonathan Delfour >Assignee: Thomas Groh >Priority: Major > > Hi, > First issue is that the beam 2.3.0 python SDK is apparently not working on > GCP. It gets stuck: > {noformat} > Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be > stuck. You can get help with Cloud Dataflow at > https://cloud.google.com/dataflow/support. > {noformat} > I tried two times. > Reverting back to 2.2.0: it usually works but today, after > 1 hour of > processing, and 30 workers used, I get a failure with these in the logs: > {noformat} > Traceback (most recent call last): > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line > 582, in do_work > work_executor.execute() > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", > line 167, in execute > op.start() > File "dataflow_worker/shuffle_operations.py", line 49, in > dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start > def start(self): > File "dataflow_worker/shuffle_operations.py", line 50, in > dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start > with self.scoped_start_state: > File "dataflow_worker/shuffle_operations.py", line 65, in > dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start > with self.shuffle_source.reader() as reader: > File "dataflow_worker/shuffle_operations.py", line 67, in > dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start > for key_values in reader: > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", > line 406, in __iter__ > for entry in entries_iterator: > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", > line 248, in next > return next(self.iterator) > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", > line 206, in __iter__ > chunk, next_position = self.reader.Read(start_position, end_position) > File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in > shuffle_client.PyShuffleReader.Read > IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2 > talking to my-dataflow-02271107-756f-harness-2p65:12346 > {noformat} > i also get some information message: > {noformat} > Refusing to split at 0x7f03a00fe790> at '\x00\x00\x00\t\x1d\x14\x87\xa3\x00\x01': unstarted > {noformat} > For the flow, I am extracting data from BQ, cleaning using pandas, exporting > as a csv file, gzipping and uploading the compressed file to a bucket using > decompressive transcoding (csv export, gzip compression and upload are in the > same 'worker' as they are done in the same beam.DoFn). > PS: i can't find a reasonable way to export the logs from GCP but i can > privately send the log file i have of the run on my machine (the log of the > pipeline) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (BEAM-3757) Shuffle read failed using python 2.2.0
[ https://issues.apache.org/jira/browse/BEAM-3757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathan Delfour updated BEAM-3757: --- Description: Hi, First issue is that the beam 2.3.0 python SDK is apparently not working on GCP. It gets stuck: {noformat} Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be stuck. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support. {noformat} I tried two times. Reverting back to 2.2.0: it usually works but today, after > 1 hour of processing, and 30 workers used, I get a failure with these in the logs: {noformat} Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute op.start() File "dataflow_worker/shuffle_operations.py", line 49, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start def start(self): File "dataflow_worker/shuffle_operations.py", line 50, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start with self.scoped_start_state: File "dataflow_worker/shuffle_operations.py", line 65, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start with self.shuffle_source.reader() as reader: File "dataflow_worker/shuffle_operations.py", line 67, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start for key_values in reader: File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 406, in __iter__ for entry in entries_iterator: File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 248, in next return next(self.iterator) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 206, in __iter__ chunk, next_position = self.reader.Read(start_position, end_position) File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in shuffle_client.PyShuffleReader.Read IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2 talking to my-dataflow-02271107-756f-harness-2p65:12346 {noformat} i also get some information message: {noformat} Refusing to split at '\x00\x00\x00\t\x1d\x14\x87\xa3\x00\x01': unstarted {noformat} For the flow, I am extracting data from BQ, cleaning using pandas, exporting as a csv file, gzipping and uploading the compressed file to a bucket using decompressive transcoding (csv export, gzip compression and upload are in the same 'worker' as they are done in the same beam.DoFn). PS: i can't find a reasonable way to export the logs from GCP but i can privately send the log file i have of the run on my machine (the log of the pipeline) was: Hi, First issue is that the beam 2.3.0 python SDK is apparently not working on GCP. It gets stuck: {noformat} Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be stuck. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support. {noformat} I tried two times. Reverting back to 2.2.0: it usually works but today, after > 1 hour of processing, and 30 workers used, I get a failure with these in the logs: {noformat} Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute op.start() File "dataflow_worker/shuffle_operations.py", line 49, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start def start(self): File "dataflow_worker/shuffle_operations.py", line 50, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start with self.scoped_start_state: File "dataflow_worker/shuffle_operations.py", line 65, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start with self.shuffle_source.reader() as reader: File "dataflow_worker/shuffle_operations.py", line 67, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start for key_values in reader: File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 406, in __iter__ for entry in entries_iterator: File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 248, in next return next(self.iterator) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 206, in __iter__ chunk, next_position = self.reader.Read(start_position, end_position) File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in shuffle_client.PyShuffleReader.Read IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2 talking to my-dataflow-02271107-756f-harness-2p65:12346
[jira] [Created] (BEAM-3757) Shuffle read failed using python 2.2.0
Jonathan Delfour created BEAM-3757: -- Summary: Shuffle read failed using python 2.2.0 Key: BEAM-3757 URL: https://issues.apache.org/jira/browse/BEAM-3757 Project: Beam Issue Type: Bug Components: runner-dataflow Affects Versions: 2.2.0 Environment: gcp, macos Reporter: Jonathan Delfour Assignee: Thomas Groh Hi, First issue is that the beam 2.3.0 python SDK is apparently not working on GCP. It gets stuck: {noformat} Workflow failed. Causes: (bf832d44290fbf41): The Dataflow appears to be stuck. You can get help with Cloud Dataflow at https://cloud.google.com/dataflow/support. {noformat} I tried two times. Reverting back to 2.2.0: it usually works but today, after > 1 hour of processing, and 30 workers used, I get a failure with these in the logs: {noformat} Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 582, in do_work work_executor.execute() File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", line 167, in execute op.start() File "dataflow_worker/shuffle_operations.py", line 49, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start def start(self): File "dataflow_worker/shuffle_operations.py", line 50, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start with self.scoped_start_state: File "dataflow_worker/shuffle_operations.py", line 65, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start with self.shuffle_source.reader() as reader: File "dataflow_worker/shuffle_operations.py", line 67, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start for key_values in reader: File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 406, in __iter__ for entry in entries_iterator: File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 248, in next return next(self.iterator) File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/shuffle.py", line 206, in __iter__ chunk, next_position = self.reader.Read(start_position, end_position) File "third_party/windmill/shuffle/python/shuffle_client.pyx", line 138, in shuffle_client.PyShuffleReader.Read IOError: Shuffle read failed: INTERNAL: Received RST_STREAM with error code 2 talking to my-dataflow-02271107-756f-harness-2p65:12346 {noformat} i also get some information message: {noformat} Refusing to split at '\x00\x00\x00\t\x1d\x14\x87\xa3\x00\x01': unstarted {noformat} For the flow, I am extracting data from BQ, cleaning using pandas, exporting as a csv file, gzipping and uploading the compressed file to a bucket using decompressive transcoding (csv export, gzip compression and upload are in the same 'worker' as they are done in the same beam.DoFn). -- This message was sent by Atlassian JIRA (v7.6.3#76005)