[jira] [Commented] (BEAM-2717) Infer coders in SDK prior to handing off pipeline to Runner
[ https://issues.apache.org/jira/browse/BEAM-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164091#comment-17164091 ] Harrison commented on BEAM-2717: I've uploaded a minimized demo pipeline here: [https://github.com/hgarrereyn/tfx-bsl/tree/beam-2717/demo] This runs in Dataflow v1 but not v2 > Infer coders in SDK prior to handing off pipeline to Runner > --- > > Key: BEAM-2717 > URL: https://issues.apache.org/jira/browse/BEAM-2717 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Robert Bradshaw >Priority: P2 > Time Spent: 3h 10m > Remaining Estimate: 0h > > Currently all runners have to duplicate this work, and there's also a hack > storing the element type rather than the coder in the Runner protos. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-2717) Infer coders in SDK prior to handing off pipeline to Runner
[ https://issues.apache.org/jira/browse/BEAM-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164064#comment-17164064 ] Harrison commented on BEAM-2717: I'm running into this issue when using Dataflow v2 runner to launch a [tfx-bsl|https://github.com/tensorflow/tfx-bsl] RunInference pipeline. The same pipeline can be run with the Dataflow v1 runner without issues. There are several feature improvements planned for tfx-bsl RunInference that require stateful DoFn support provided by the v2 runner so this is currently a blocker. > Infer coders in SDK prior to handing off pipeline to Runner > --- > > Key: BEAM-2717 > URL: https://issues.apache.org/jira/browse/BEAM-2717 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Robert Bradshaw >Priority: P2 > Time Spent: 3h 10m > Remaining Estimate: 0h > > Currently all runners have to duplicate this work, and there's also a hack > storing the element type rather than the coder in the Runner protos. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10192) Duplicate PubSub subscriptions with python direct runner in Jupyter/Colab environment
[ https://issues.apache.org/jira/browse/BEAM-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126174#comment-17126174 ] Harrison commented on BEAM-10192: - This may be related to BEAM-2988 > Duplicate PubSub subscriptions with python direct runner in Jupyter/Colab > environment > - > > Key: BEAM-10192 > URL: https://issues.apache.org/jira/browse/BEAM-10192 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.22.0 > Environment: Ubuntu 18 (Colab notebook), Python SDK >Reporter: Harrison >Priority: P2 > > When running a streaming pipeline on Colab with direct runner, ReadFromPubSub > can retain old subscriptions and cause message duplication. For example, > manually killing a cell that is running a streaming pubsub pipeline does not > delete the pubsub subscription. If the cell is rerun, the ReadFromPubSub > component will actually be subscribed twice which results in duplicate > messages. > Manually deleting old subscriptions (e.g. via the GCP dashboard) temporarily > fixes the problem. > This Colab notebook: > [https://gist.github.com/hgarrereyn/64ce87cbbcbe9c34ccdd13eafe49e3fb] > contains a runnable example of the bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10192) Duplicate PubSub subscriptions with python direct runner in Jupyter/Colab environment
[ https://issues.apache.org/jira/browse/BEAM-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Harrison updated BEAM-10192: Description: When running a streaming pipeline on Colab with direct runner, ReadFromPubSub can retain old subscriptions and cause message duplication. For example, manually killing a cell that is running a streaming pubsub pipeline does not delete the pubsub subscription. If the cell is rerun, the ReadFromPubSub component will actually be subscribed twice which results in duplicate messages. Manually deleting old subscriptions (e.g. via the GCP dashboard) temporarily fixes the problem. This Colab notebook: [https://gist.github.com/hgarrereyn/64ce87cbbcbe9c34ccdd13eafe49e3fb] contains a runnable example of the bug. was: When running a streaming pipeline on Colab with direct runner, ReadFromPubSub can retain old subscriptions and cause message duplication. For example, manually killing a cell that is running a streaming pubsub pipeline does not delete the pubsub subscription. If the cell is rerun, the ReadFromPubSub component will actually be subscribed twice which results in duplicate messages. Manually deleting old subscriptions (e.g. via the GCP dashboard) temporarily fixes the problem. This [Colab notebook|[https://gist.github.com/hgarrereyn/64ce87cbbcbe9c34ccdd13eafe49e3fb]] contains a runnable example of the bug. > Duplicate PubSub subscriptions with python direct runner in Jupyter/Colab > environment > - > > Key: BEAM-10192 > URL: https://issues.apache.org/jira/browse/BEAM-10192 > Project: Beam > Issue Type: Bug > Components: io-py-gcp >Affects Versions: 2.22.0 > Environment: Ubuntu 18 (Colab notebook), Python SDK >Reporter: Harrison >Priority: P2 > > When running a streaming pipeline on Colab with direct runner, ReadFromPubSub > can retain old subscriptions and cause message duplication. For example, > manually killing a cell that is running a streaming pubsub pipeline does not > delete the pubsub subscription. If the cell is rerun, the ReadFromPubSub > component will actually be subscribed twice which results in duplicate > messages. > Manually deleting old subscriptions (e.g. via the GCP dashboard) temporarily > fixes the problem. > This Colab notebook: > [https://gist.github.com/hgarrereyn/64ce87cbbcbe9c34ccdd13eafe49e3fb] > contains a runnable example of the bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-10192) Duplicate PubSub subscriptions with python direct runner in Jupyter/Colab environment
Harrison created BEAM-10192: --- Summary: Duplicate PubSub subscriptions with python direct runner in Jupyter/Colab environment Key: BEAM-10192 URL: https://issues.apache.org/jira/browse/BEAM-10192 Project: Beam Issue Type: Bug Components: io-py-gcp Affects Versions: 2.22.0 Environment: Ubuntu 18 (Colab notebook), Python SDK Reporter: Harrison When running a streaming pipeline on Colab with direct runner, ReadFromPubSub can retain old subscriptions and cause message duplication. For example, manually killing a cell that is running a streaming pubsub pipeline does not delete the pubsub subscription. If the cell is rerun, the ReadFromPubSub component will actually be subscribed twice which results in duplicate messages. Manually deleting old subscriptions (e.g. via the GCP dashboard) temporarily fixes the problem. This [Colab notebook|[https://gist.github.com/hgarrereyn/64ce87cbbcbe9c34ccdd13eafe49e3fb]] contains a runnable example of the bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)