[jira] [Commented] (BEAM-2717) Infer coders in SDK prior to handing off pipeline to Runner

2020-07-23 Thread Harrison (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164091#comment-17164091
 ] 

Harrison commented on BEAM-2717:


I've uploaded a minimized demo pipeline here: 
[https://github.com/hgarrereyn/tfx-bsl/tree/beam-2717/demo]

This runs in Dataflow v1 but not v2

> Infer coders in SDK prior to handing off pipeline to Runner
> ---
>
> Key: BEAM-2717
> URL: https://issues.apache.org/jira/browse/BEAM-2717
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Robert Bradshaw
>Priority: P2
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Currently all runners have to duplicate this work, and there's also a hack 
> storing the element type rather than the coder in the Runner protos.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-2717) Infer coders in SDK prior to handing off pipeline to Runner

2020-07-23 Thread Harrison (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164064#comment-17164064
 ] 

Harrison commented on BEAM-2717:


I'm running into this issue when using Dataflow v2 runner to launch a 
[tfx-bsl|https://github.com/tensorflow/tfx-bsl] RunInference pipeline. The same 
pipeline can be run with the Dataflow v1 runner without issues.

There are several feature improvements planned for tfx-bsl RunInference that 
require stateful DoFn support provided by the v2 runner so this is currently a 
blocker.

> Infer coders in SDK prior to handing off pipeline to Runner
> ---
>
> Key: BEAM-2717
> URL: https://issues.apache.org/jira/browse/BEAM-2717
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Robert Bradshaw
>Priority: P2
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Currently all runners have to duplicate this work, and there's also a hack 
> storing the element type rather than the coder in the Runner protos.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10192) Duplicate PubSub subscriptions with python direct runner in Jupyter/Colab environment

2020-06-04 Thread Harrison (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17126174#comment-17126174
 ] 

Harrison commented on BEAM-10192:
-

This may be related to BEAM-2988 

> Duplicate PubSub subscriptions with python direct runner in Jupyter/Colab 
> environment
> -
>
> Key: BEAM-10192
> URL: https://issues.apache.org/jira/browse/BEAM-10192
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.22.0
> Environment: Ubuntu 18 (Colab notebook), Python SDK
>Reporter: Harrison
>Priority: P2
>
> When running a streaming pipeline on Colab with direct runner, ReadFromPubSub 
> can retain old subscriptions and cause message duplication. For example, 
> manually killing a cell that is running a streaming pubsub pipeline does not 
> delete the pubsub subscription. If the cell is rerun, the ReadFromPubSub 
> component will actually be subscribed twice which results in duplicate 
> messages.
> Manually deleting old subscriptions (e.g. via the GCP dashboard) temporarily 
> fixes the problem.
> This Colab notebook: 
> [https://gist.github.com/hgarrereyn/64ce87cbbcbe9c34ccdd13eafe49e3fb] 
> contains a runnable example of the bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10192) Duplicate PubSub subscriptions with python direct runner in Jupyter/Colab environment

2020-06-03 Thread Harrison (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Harrison updated BEAM-10192:

Description: 
When running a streaming pipeline on Colab with direct runner, ReadFromPubSub 
can retain old subscriptions and cause message duplication. For example, 
manually killing a cell that is running a streaming pubsub pipeline does not 
delete the pubsub subscription. If the cell is rerun, the ReadFromPubSub 
component will actually be subscribed twice which results in duplicate messages.

Manually deleting old subscriptions (e.g. via the GCP dashboard) temporarily 
fixes the problem.

This Colab notebook: 
[https://gist.github.com/hgarrereyn/64ce87cbbcbe9c34ccdd13eafe49e3fb] contains 
a runnable example of the bug.

  was:
When running a streaming pipeline on Colab with direct runner, ReadFromPubSub 
can retain old subscriptions and cause message duplication. For example, 
manually killing a cell that is running a streaming pubsub pipeline does not 
delete the pubsub subscription. If the cell is rerun, the ReadFromPubSub 
component will actually be subscribed twice which results in duplicate messages.

Manually deleting old subscriptions (e.g. via the GCP dashboard) temporarily 
fixes the problem.

This [Colab 
notebook|[https://gist.github.com/hgarrereyn/64ce87cbbcbe9c34ccdd13eafe49e3fb]] 
contains a runnable example of the bug.


> Duplicate PubSub subscriptions with python direct runner in Jupyter/Colab 
> environment
> -
>
> Key: BEAM-10192
> URL: https://issues.apache.org/jira/browse/BEAM-10192
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-gcp
>Affects Versions: 2.22.0
> Environment: Ubuntu 18 (Colab notebook), Python SDK
>Reporter: Harrison
>Priority: P2
>
> When running a streaming pipeline on Colab with direct runner, ReadFromPubSub 
> can retain old subscriptions and cause message duplication. For example, 
> manually killing a cell that is running a streaming pubsub pipeline does not 
> delete the pubsub subscription. If the cell is rerun, the ReadFromPubSub 
> component will actually be subscribed twice which results in duplicate 
> messages.
> Manually deleting old subscriptions (e.g. via the GCP dashboard) temporarily 
> fixes the problem.
> This Colab notebook: 
> [https://gist.github.com/hgarrereyn/64ce87cbbcbe9c34ccdd13eafe49e3fb] 
> contains a runnable example of the bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10192) Duplicate PubSub subscriptions with python direct runner in Jupyter/Colab environment

2020-06-03 Thread Harrison (Jira)
Harrison created BEAM-10192:
---

 Summary: Duplicate PubSub subscriptions with python direct runner 
in Jupyter/Colab environment
 Key: BEAM-10192
 URL: https://issues.apache.org/jira/browse/BEAM-10192
 Project: Beam
  Issue Type: Bug
  Components: io-py-gcp
Affects Versions: 2.22.0
 Environment: Ubuntu 18 (Colab notebook), Python SDK
Reporter: Harrison


When running a streaming pipeline on Colab with direct runner, ReadFromPubSub 
can retain old subscriptions and cause message duplication. For example, 
manually killing a cell that is running a streaming pubsub pipeline does not 
delete the pubsub subscription. If the cell is rerun, the ReadFromPubSub 
component will actually be subscribed twice which results in duplicate messages.

Manually deleting old subscriptions (e.g. via the GCP dashboard) temporarily 
fixes the problem.

This [Colab 
notebook|[https://gist.github.com/hgarrereyn/64ce87cbbcbe9c34ccdd13eafe49e3fb]] 
contains a runnable example of the bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)