[jira] [Created] (BEAM-10189) Add ValueState to python sdk

2020-06-03 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-10189:
--

 Summary: Add ValueState to python sdk
 Key: BEAM-10189
 URL: https://issues.apache.org/jira/browse/BEAM-10189
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Yichi Zhang
Assignee: Yichi Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9603) Support Dynamic Timer in Java SDK over FnApi

2020-06-02 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-9603.
---
Fix Version/s: 2.22.0
   Resolution: Fixed

> Support Dynamic Timer in Java SDK over FnApi
> 
>
> Key: BEAM-9603
> URL: https://issues.apache.org/jira/browse/BEAM-9603
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-harness
>Reporter: Boyuan Zhang
>Assignee: Yichi Zhang
>Priority: P2
>  Labels: stale-assigned
> Fix For: 2.22.0
>
>  Time Spent: 8.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9263) Bump python sdk fnapi version to enable status reporting

2020-06-02 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-9263.
---
Fix Version/s: 2.20.0
   Resolution: Fixed

> Bump python sdk fnapi version to enable status reporting
> 
>
> Key: BEAM-9263
> URL: https://issues.apache.org/jira/browse/BEAM-9263
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py-core
>Affects Versions: 2.20.0
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: P3
>  Labels: stale-assigned
> Fix For: 2.20.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Bump python sdk fn api environment version to 8 for roll out the status 
> feature for sdk harness status reporting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10153) Java SDK Harness throws NPE for VoidCoder as key coder

2020-05-29 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-10153:
---
Description: 
When using VoidCoder as the key coder, NullPointerException will be thrown for 
processing element.

{code:java}
java.lang.NullPointerException
at 
org.apache.beam.model.fnexecution.v1.BeamFnApi$StateKey$BagUserState$Builder.setKey(BeamFnApi.java:39390)
at 
org.apache.beam.fn.harness.state.FnApiStateAccessor.createBagUserStateKey(FnApiStateAccessor.java:444)
at 
org.apache.beam.fn.harness.state.FnApiStateAccessor.bindValue(FnApiStateAccessor.java:195)
at 
org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:327)
at 
org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:317)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.state(FnApiDoFnRunner.java:1432)
at 
org.apache.beam.sdk.transforms.ParDoTest$TimerTests$TwoTimerTest$1$DoFnInvoker.invokeProcessElement(Unknown
 Source)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:740)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner.access$700(FnApiDoFnRunner.java:132)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner$Factory.lambda$createRunnerForPTransform$1(FnApiDoFnRunner.java:203)
at 
org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:216)
at 
org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:179)
at 
org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:204)
at 
org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:106)
at 
org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:295)
at 
org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173)
at 
org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}

This fails validate runner tests such as 
https://github.com/apache/beam/blob/03d99dfa359f44a29a772fcc8ec8b0a237cab113/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ParDoTest.java#L4120
 on dataflow runner with beam_fn_api.



  was:
When using VoidCoder as the key coder, NullPointerException will be thrown for 
accessing state.

{code:java}
java.lang.NullPointerException
at 
org.apache.beam.model.fnexecution.v1.BeamFnApi$StateKey$BagUserState$Builder.setKey(BeamFnApi.java:39390)
at 
org.apache.beam.fn.harness.state.FnApiStateAccessor.createBagUserStateKey(FnApiStateAccessor.java:444)
at 
org.apache.beam.fn.harness.state.FnApiStateAccessor.bindValue(FnApiStateAccessor.java:195)
at 
org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:327)
at 
org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:317)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.state(FnApiDoFnRunner.java:1432)
at 
org.apache.beam.sdk.transforms.ParDoTest$TimerTests$TwoTimerTest$1$DoFnInvoker.invokeProcessElement(Unknown
 Source)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:740)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner.access$700(FnApiDoFnRunner.java:132)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner$Factory.lambda$createRunnerForPTransform$1(FnApiDoFnRunner.java:203)
at 
org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:216)
at 
org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:179)
at 
org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:204)
at 
org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:106)
at 
org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:295)
at 
org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173)
at 
org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157)
at 

[jira] [Created] (BEAM-10153) Java SDK Harness throws NPE for VoidCoder for key coder

2020-05-29 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-10153:
--

 Summary: Java SDK Harness throws NPE for VoidCoder for key coder
 Key: BEAM-10153
 URL: https://issues.apache.org/jira/browse/BEAM-10153
 Project: Beam
  Issue Type: Bug
  Components: sdk-java-harness
Reporter: Yichi Zhang


When using VoidCoder as the key coder, NullPointerException will be thrown for 
accessing state.

{code:java}
java.lang.NullPointerException
at 
org.apache.beam.model.fnexecution.v1.BeamFnApi$StateKey$BagUserState$Builder.setKey(BeamFnApi.java:39390)
at 
org.apache.beam.fn.harness.state.FnApiStateAccessor.createBagUserStateKey(FnApiStateAccessor.java:444)
at 
org.apache.beam.fn.harness.state.FnApiStateAccessor.bindValue(FnApiStateAccessor.java:195)
at 
org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:327)
at 
org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:317)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.state(FnApiDoFnRunner.java:1432)
at 
org.apache.beam.sdk.transforms.ParDoTest$TimerTests$TwoTimerTest$1$DoFnInvoker.invokeProcessElement(Unknown
 Source)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:740)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner.access$700(FnApiDoFnRunner.java:132)
at 
org.apache.beam.fn.harness.FnApiDoFnRunner$Factory.lambda$createRunnerForPTransform$1(FnApiDoFnRunner.java:203)
at 
org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:216)
at 
org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:179)
at 
org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:204)
at 
org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:106)
at 
org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:295)
at 
org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173)
at 
org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-10153) Java SDK Harness throws NPE for VoidCoder as key coder

2020-05-29 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-10153:
---
Summary: Java SDK Harness throws NPE for VoidCoder as key coder  (was: Java 
SDK Harness throws NPE for VoidCoder for key coder)

> Java SDK Harness throws NPE for VoidCoder as key coder
> --
>
> Key: BEAM-10153
> URL: https://issues.apache.org/jira/browse/BEAM-10153
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-harness
>Reporter: Yichi Zhang
>Priority: P2
>
> When using VoidCoder as the key coder, NullPointerException will be thrown 
> for accessing state.
> {code:java}
> java.lang.NullPointerException
>   at 
> org.apache.beam.model.fnexecution.v1.BeamFnApi$StateKey$BagUserState$Builder.setKey(BeamFnApi.java:39390)
>   at 
> org.apache.beam.fn.harness.state.FnApiStateAccessor.createBagUserStateKey(FnApiStateAccessor.java:444)
>   at 
> org.apache.beam.fn.harness.state.FnApiStateAccessor.bindValue(FnApiStateAccessor.java:195)
>   at 
> org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:327)
>   at 
> org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:317)
>   at 
> org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.state(FnApiDoFnRunner.java:1432)
>   at 
> org.apache.beam.sdk.transforms.ParDoTest$TimerTests$TwoTimerTest$1$DoFnInvoker.invokeProcessElement(Unknown
>  Source)
>   at 
> org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:740)
>   at 
> org.apache.beam.fn.harness.FnApiDoFnRunner.access$700(FnApiDoFnRunner.java:132)
>   at 
> org.apache.beam.fn.harness.FnApiDoFnRunner$Factory.lambda$createRunnerForPTransform$1(FnApiDoFnRunner.java:203)
>   at 
> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:216)
>   at 
> org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:179)
>   at 
> org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:204)
>   at 
> org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:106)
>   at 
> org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:295)
>   at 
> org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173)
>   at 
> org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-10112) Add python sdk state and timer examples to website

2020-05-27 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang reassigned BEAM-10112:
--

Assignee: Yichi Zhang

> Add python sdk state and timer examples to website
> --
>
> Key: BEAM-10112
> URL: https://issues.apache.org/jira/browse/BEAM-10112
> Project: Beam
>  Issue Type: Improvement
>  Components: website
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: P2
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-10112) Add python sdk state and timer examples to website

2020-05-27 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-10112:
--

 Summary: Add python sdk state and timer examples to website
 Key: BEAM-10112
 URL: https://issues.apache.org/jira/browse/BEAM-10112
 Project: Beam
  Issue Type: Improvement
  Components: website
Reporter: Yichi Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9602) Support Dynamic Timer in Python SDK over FnApi

2020-05-26 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang reassigned BEAM-9602:
-

Assignee: Yichi Zhang

> Support Dynamic Timer in Python SDK over FnApi
> --
>
> Key: BEAM-9602
> URL: https://issues.apache.org/jira/browse/BEAM-9602
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-harness
>Reporter: Boyuan Zhang
>Assignee: Yichi Zhang
>Priority: P2
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-10002) Mongo cursor timeout leads to CursorNotFound error

2020-05-15 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108429#comment-17108429
 ] 

Yichi Zhang commented on BEAM-10002:


Thanks for the issue report [~corvin], do you want also contribute a PR to 
address it, since I'm not sure when I'll get time to work on this at the 
moment. 

> Mongo cursor timeout leads to CursorNotFound error
> --
>
> Key: BEAM-10002
> URL: https://issues.apache.org/jira/browse/BEAM-10002
> Project: Beam
>  Issue Type: Bug
>  Components: io-py-mongodb
>Affects Versions: 2.20.0
>Reporter: Corvin Deboeser
>Assignee: Yichi Zhang
>Priority: Major
>
> If some work items take a lot of processing time and the cursor of a bundle 
> is not queried for too long, then mongodb will timeout the cursor which 
> results in
> {code:java}
> pymongo.errors.CursorNotFound: cursor id ... not found
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9940) Dataflow runner not setting timer family specs for TimerDeclaration annotation

2020-05-08 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9940:
-

 Summary: Dataflow runner not setting timer family specs for 
TimerDeclaration annotation
 Key: BEAM-9940
 URL: https://issues.apache.org/jira/browse/BEAM-9940
 Project: Beam
  Issue Type: Bug
  Components: runner-dataflow
Reporter: Yichi Zhang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9904) python nexmark query pipelines

2020-05-07 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang closed BEAM-9904.
-
Fix Version/s: Not applicable
   Resolution: Abandoned

> python nexmark query pipelines
> --
>
> Key: BEAM-9904
> URL: https://issues.apache.org/jira/browse/BEAM-9904
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
> Fix For: Not applicable
>
>
> Create python beam pipelines for the data analytic queries in NexMark suite:
>  
> ||query||corresponding java code||
> |*Query 0 (not part of original NexMark): 
> Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
> |*Query 1: What are the bid values in Euro's? (Currency 
> Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query1.java]|
> |*Query 2: Find bids with specific auction ids and show their bid 
> price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query2.java]|
> |*Query 3: Who is selling in particular US 
> states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query3.java]|
> |*Query 4: What is the average selling price for each auction 
> category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query4.java]|
> |*Query 5: Which auctions have seen the most bids in the last 
> period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query5.java]|
> | | |
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9905) python nexmark benchmark suite metrics

2020-05-07 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang closed BEAM-9905.
-
Fix Version/s: Not applicable
   Resolution: Abandoned

> python nexmark benchmark suite metrics
> --
>
> Key: BEAM-9905
> URL: https://issues.apache.org/jira/browse/BEAM-9905
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
> Fix For: Not applicable
>
>
> Ensure we can collect output metrics from the query pipelines with the 
> jenkins test infra such as:
> execution time, processing event rate, number of results, also invalid 
> auctions/bids, …



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9902) python nexmark benchmark suite event generator

2020-05-07 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang closed BEAM-9902.
-
Fix Version/s: Not applicable
   Resolution: Abandoned

> python nexmark benchmark suite event generator
> --
>
> Key: BEAM-9902
> URL: https://issues.apache.org/jira/browse/BEAM-9902
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
> Fix For: Not applicable
>
>
> Implement the event generator in python to create the source which the query 
> pipeline will read from.
> For reference, the java source generators can be found in 
> [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/sources]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9903) python nexmark benchmark suite launcher

2020-05-07 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang closed BEAM-9903.
-
Fix Version/s: Not applicable
   Resolution: Abandoned

> python nexmark benchmark suite launcher
> ---
>
> Key: BEAM-9903
> URL: https://issues.apache.org/jira/browse/BEAM-9903
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
> Fix For: Not applicable
>
>
> Create the launcher as well as the configurations for benchmark entry program.
> For reference:
> the java launcher: 
> [https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/NexmarkLauncher.java]
> and its configuration: 
> [https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/NexmarkOptions.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9901) Beam python nexmark benchmark suite

2020-05-07 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang closed BEAM-9901.
-
Fix Version/s: Not applicable
   Resolution: Resolved

> Beam python nexmark benchmark suite
> ---
>
> Key: BEAM-9901
> URL: https://issues.apache.org/jira/browse/BEAM-9901
> Project: Beam
>  Issue Type: Task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
> Fix For: Not applicable
>
>
> Nexmark is a suite of queries (pipelines) used to measure performance and 
> non-regression in Beam. Currently it exists in java sdk: 
> [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark]
> In this project we would like to create the nexmark benchmark suite in python 
> sdk equivalent to what BEAM has for java. This allows us to determine 
> performance impact on pull requests for python pipelines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9906) Beam python nexmark benchmark starter task

2020-05-07 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang closed BEAM-9906.
-
Fix Version/s: Not applicable
   Resolution: Abandoned

> Beam python nexmark benchmark starter task
> --
>
> Key: BEAM-9906
> URL: https://issues.apache.org/jira/browse/BEAM-9906
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
> Fix For: Not applicable
>
>
> For this starter task, the assignee is expected to 
> * get familiar with beam model by browse through the beam website and 
> programming guide: https://beam.apache.org/ 
> * set up development environment: 
> https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides
> * also get some familiarity with the NexMark suite:
> http://datalab.cs.pdx.edu/niagara/pstream/nexmark.pdf
> * it is also recommended to spend some time to get familiar with the git 
> workflow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9901) Beam python nexmark benchmark suite

2020-05-07 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101856#comment-17101856
 ] 

Yichi Zhang commented on BEAM-9901:
---

Ok, I guess we can use BEAM-8258 for the summer intern as well. I'll add 
corresponding dataflow components.

> Beam python nexmark benchmark suite
> ---
>
> Key: BEAM-9901
> URL: https://issues.apache.org/jira/browse/BEAM-9901
> Project: Beam
>  Issue Type: Task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
> Nexmark is a suite of queries (pipelines) used to measure performance and 
> non-regression in Beam. Currently it exists in java sdk: 
> [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark]
> In this project we would like to create the nexmark benchmark suite in python 
> sdk equivalent to what BEAM has for java. This allows us to determine 
> performance impact on pull requests for python pipelines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9905) python nexmark benchmark suite metrics

2020-05-07 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101848#comment-17101848
 ] 

Yichi Zhang commented on BEAM-9905:
---

[~iemejia]It is more about being able to run the queries with test infra 
continuously, rephrased the description a little bit.

> python nexmark benchmark suite metrics
> --
>
> Key: BEAM-9905
> URL: https://issues.apache.org/jira/browse/BEAM-9905
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
> Ensure we can collect output metrics from the query pipelines with the 
> jenkins test infra such as:
> execution time, processing event rate, number of results, also invalid 
> auctions/bids, …



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9905) python nexmark benchmark suite metrics

2020-05-07 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9905:
--
Description: 
Ensure we can collect output metrics from the query pipelines with the jenkins 
test infra such as:
execution time, processing event rate, number of results, also invalid 
auctions/bids, …

  was:
Ensure we can collect output metrics from the query pipelines such as:
execution time, processing event rate, number of results, also invalid 
auctions/bids, …


> python nexmark benchmark suite metrics
> --
>
> Key: BEAM-9905
> URL: https://issues.apache.org/jira/browse/BEAM-9905
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
> Ensure we can collect output metrics from the query pipelines with the 
> jenkins test infra such as:
> execution time, processing event rate, number of results, also invalid 
> auctions/bids, …



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9901) Beam python nexmark benchmark suite

2020-05-07 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101845#comment-17101845
 ] 

Yichi Zhang commented on BEAM-9901:
---

[~iemejia] We are creating these tickets for google summer intern program, I 
think it might be helpful to limit the scope and create these tickets under one 
umbrella ticket to avoid distraction. We can deduplicate the tickets depending 
on how much progress has been made later. WDYT.

> Beam python nexmark benchmark suite
> ---
>
> Key: BEAM-9901
> URL: https://issues.apache.org/jira/browse/BEAM-9901
> Project: Beam
>  Issue Type: Task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
> Nexmark is a suite of queries (pipelines) used to measure performance and 
> non-regression in Beam. Currently it exists in java sdk: 
> [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark]
> In this project we would like to create the nexmark benchmark suite in python 
> sdk equivalent to what BEAM has for java. This allows us to determine 
> performance impact on pull requests for python pipelines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9906) Beam python nextmark benchmark starter task

2020-05-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9906:
--
Description: 
For this starter task, the assignee is expected to 
* get familiar with beam model by browse through the beam website and 
programming guide: https://beam.apache.org/ 

* set up development environment: 
https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides

* also get some familiarity with the NexMark suite:
http://datalab.cs.pdx.edu/niagara/pstream/nexmark.pdf

* it is also recommended to spend some time to get familiar with the git 
workflow

  was:
For this starter task, the assignee is expected to 
* get familiar with beam model by browse through the beam website and 
programming guide: https://beam.apache.org/ 

* set up development environment: 
https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides

* also get some familiarity with the NexMark suite:
http://datalab.cs.pdx.edu/niagara/pstream/nexmark.pdf

* if not familiar with github and git, it is also recommended to spend some 
time to get familiar with the git workflow


> Beam python nextmark benchmark starter task
> ---
>
> Key: BEAM-9906
> URL: https://issues.apache.org/jira/browse/BEAM-9906
> Project: Beam
>  Issue Type: Sub-task
>  Components: testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
> For this starter task, the assignee is expected to 
> * get familiar with beam model by browse through the beam website and 
> programming guide: https://beam.apache.org/ 
> * set up development environment: 
> https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides
> * also get some familiarity with the NexMark suite:
> http://datalab.cs.pdx.edu/niagara/pstream/nexmark.pdf
> * it is also recommended to spend some time to get familiar with the git 
> workflow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9906) Beam python nextmark benchmark starter task

2020-05-06 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9906:
-

 Summary: Beam python nextmark benchmark starter task
 Key: BEAM-9906
 URL: https://issues.apache.org/jira/browse/BEAM-9906
 Project: Beam
  Issue Type: Sub-task
  Components: testing-nexmark
Reporter: Yichi Zhang


For this starter task, the assignee is expected to 
* get familiar with beam model by browse through the beam website and 
programming guide: https://beam.apache.org/ 

* set up development environment: 
https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides

* also get some familiarity with the NexMark suite:
http://datalab.cs.pdx.edu/niagara/pstream/nexmark.pdf

* if not familiar with github and git, it is also recommended to spend some 
time to get familiar with the git workflow



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9905) python nexmark benchmark suite metrics

2020-05-06 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9905:
-

 Summary: python nexmark benchmark suite metrics
 Key: BEAM-9905
 URL: https://issues.apache.org/jira/browse/BEAM-9905
 Project: Beam
  Issue Type: Sub-task
  Components: benchmarking-py, testing-nexmark
Reporter: Yichi Zhang


Ensure we can collect output metrics from the query pipelines such as:
execution time, processing event rate, number of results, also invalid 
auctions/bids, …



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9904) python nexmark query pipelines

2020-05-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9904:
--
Description: 
Create python beam pipelines for the data analytic queries in NexMark suite:

 
||query||corresponding java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
|*Query 1: What are the bid values in Euro's? (Currency 
Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]|
|*Query 2: Find bids with specific auction ids and show their bid 
price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]|
|*Query 3: Who is selling in particular US 
states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]|
|*Query 4: What is the average selling price for each auction 
category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]|
|*Query 5: Which auctions have seen the most bids in the last 
period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]|
| | |

 

  was:
 
||query||java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
|*Query 1: What are the bid values in Euro's? (Currency 
Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]|
|*Query 2: Find bids with specific auction ids and show their bid 
price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]|
|*Query 3: Who is selling in particular US 
states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]|
|*Query 4: What is the average selling price for each auction 
category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]|
|*Query 5: Which auctions have seen the most bids in the last 
period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]|
| | |

 


> python nexmark query pipelines
> --
>
> Key: BEAM-9904
> URL: https://issues.apache.org/jira/browse/BEAM-9904
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
> Create python beam pipelines for the data analytic queries in NexMark suite:
>  
> ||query||corresponding java code||
> |*Query 0 (not part of original NexMark): 
> Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
> |*Query 1: What are the bid values in Euro's? (Currency 
> Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]|
> |*Query 2: Find bids with specific auction ids and show their bid 
> price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]|
> |*Query 3: Who is selling in particular US 
> states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]|
> |*Query 4: What is the average selling price for each auction 
> category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]|
> |*Query 5: Which auctions have seen the most bids in the last 
> period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]|
> | | |
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9904) python nexmark query pipelines

2020-05-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9904:
--
Description: 
Create python beam pipelines for the data analytic queries in NexMark suite:

 
||query||corresponding java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
|*Query 1: What are the bid values in Euro's? (Currency 
Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query1.java]|
|*Query 2: Find bids with specific auction ids and show their bid 
price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query2.java]|
|*Query 3: Who is selling in particular US 
states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query3.java]|
|*Query 4: What is the average selling price for each auction 
category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query4.java]|
|*Query 5: Which auctions have seen the most bids in the last 
period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query5.java]|
| | |

 

  was:
Create python beam pipelines for the data analytic queries in NexMark suite:

 
||query||corresponding java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
|*Query 1: What are the bid values in Euro's? (Currency 
Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]|
|*Query 2: Find bids with specific auction ids and show their bid 
price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]|
|*Query 3: Who is selling in particular US 
states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]|
|*Query 4: What is the average selling price for each auction 
category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]|
|*Query 5: Which auctions have seen the most bids in the last 
period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]|
| | |

 


> python nexmark query pipelines
> --
>
> Key: BEAM-9904
> URL: https://issues.apache.org/jira/browse/BEAM-9904
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
> Create python beam pipelines for the data analytic queries in NexMark suite:
>  
> ||query||corresponding java code||
> |*Query 0 (not part of original NexMark): 
> Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
> |*Query 1: What are the bid values in Euro's? (Currency 
> Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query1.java]|
> |*Query 2: Find bids with specific auction ids and show their bid 
> price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query2.java]|
> |*Query 3: Who is selling in particular US 
> states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query3.java]|
> |*Query 4: What is the average selling price for each auction 
> category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query4.java]|
> |*Query 5: Which auctions have seen the most bids in the last 
> period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query5.java]|
> | | |
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9904) python nexmark query pipelines

2020-05-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9904:
--
Description: 
 
||query||java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
|*Query 1: What are the bid values in Euro's? (Currency 
Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]|
|*Query 2: Find bids with specific auction ids and show their bid 
price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]|
|*Query 3: Who is selling in particular US 
states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]|
|*Query 4: What is the average selling price for each auction 
category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]|
|*Query 5: Which auctions have seen the most bids in the last 
period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]|
| | |

 

  was:
 
||query||java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
|*Query 1: What are the bid values in Euro's? (Currency Conversion)*| 
[query1.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]|
|*Query 2: Find bids with specific auction ids and show their bid price.*|  
[query2.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]|
|*Query 3: Who is selling in particular US states?*|  
[query3.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]|
|*Query 4: What is the average selling price for each auction category?*|  
[query4.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]|
|*Query 5: Which auctions have seen the most bids in the last period?*|  
[query5.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]|
| | |

 


> python nexmark query pipelines
> --
>
> Key: BEAM-9904
> URL: https://issues.apache.org/jira/browse/BEAM-9904
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
>  
> ||query||java code||
> |*Query 0 (not part of original NexMark): 
> Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
> |*Query 1: What are the bid values in Euro's? (Currency 
> Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]|
> |*Query 2: Find bids with specific auction ids and show their bid 
> price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]|
> |*Query 3: Who is selling in particular US 
> states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]|
> |*Query 4: What is the average selling price for each auction 
> category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]|
> |*Query 5: Which auctions have seen the most bids in the last 
> period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]|
> | | |
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9904) python nexmark query pipelines

2020-05-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9904:
--
Description: 
 
||query||java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
|*Query 1: What are the bid values in Euro's? (Currency Conversion)*| 
[query1.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]|
|*Query 2: Find bids with specific auction ids and show their bid price.*|  
[query2.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]|
|*Query 3: Who is selling in particular US states?*|  
[query3.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]|
|*Query 4: What is the average selling price for each auction category?*|  
[query4.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]|
|*Query 5: Which auctions have seen the most bids in the last period?*|  
[query5.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]|
| | |

 

  was:
 
||query||java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
|*Query 1: What are the bid values in Euro's? (Currency Conversion)*| |
|*Query 2: Find bids with specific auction ids and show their bid price.*| |
|*Query 3: Who is selling in particular US states?*| |
|*Query 4: What is the average selling price for each auction category?*| |
|*Query 5: Which auctions have seen the most bids in the last period?*| |
| | |

 


> python nexmark query pipelines
> --
>
> Key: BEAM-9904
> URL: https://issues.apache.org/jira/browse/BEAM-9904
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
>  
> ||query||java code||
> |*Query 0 (not part of original NexMark): 
> Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
> |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| 
> [query1.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]|
> |*Query 2: Find bids with specific auction ids and show their bid price.*|  
> [query2.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]|
> |*Query 3: Who is selling in particular US states?*|  
> [query3.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]|
> |*Query 4: What is the average selling price for each auction category?*|  
> [query4.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]|
> |*Query 5: Which auctions have seen the most bids in the last period?*|  
> [query5.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]|
> | | |
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9904) python nexmark query pipelines

2020-05-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9904:
--
Description: 
 
||query||java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[#query0.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
|*Query 1: What are the bid values in Euro's? (Currency Conversion)*| |
|*Query 2: Find bids with specific auction ids and show their bid price.*| |
|*Query 3: Who is selling in particular US states?*| |
|*Query 4: What is the average selling price for each auction category?*| |
|*Query 5: Which auctions have seen the most bids in the last period?*| |
| | |

 

  was:
 
||query||java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[query0.java\|[https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]]|
|*Query 1: What are the bid values in Euro's? (Currency Conversion)*| |
|*Query 2: Find bids with specific auction ids and show their bid price.*| |
|*Query 3: Who is selling in particular US states?*| |
|*Query 4: What is the average selling price for each auction category?*| |
|*Query 5: Which auctions have seen the most bids in the last period?*| |
| | |

 


> python nexmark query pipelines
> --
>
> Key: BEAM-9904
> URL: https://issues.apache.org/jira/browse/BEAM-9904
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
>  
> ||query||java code||
> |*Query 0 (not part of original NexMark): 
> Pass-through.*|[#query0.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
> |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| |
> |*Query 2: Find bids with specific auction ids and show their bid price.*| |
> |*Query 3: Who is selling in particular US states?*| |
> |*Query 4: What is the average selling price for each auction category?*| |
> |*Query 5: Which auctions have seen the most bids in the last period?*| |
> | | |
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9904) python nexmark query pipelines

2020-05-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9904:
--
Description: 
 
||query||java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
|*Query 1: What are the bid values in Euro's? (Currency Conversion)*| |
|*Query 2: Find bids with specific auction ids and show their bid price.*| |
|*Query 3: Who is selling in particular US states?*| |
|*Query 4: What is the average selling price for each auction category?*| |
|*Query 5: Which auctions have seen the most bids in the last period?*| |
| | |

 

  was:
 
||query||java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[#query0.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
|*Query 1: What are the bid values in Euro's? (Currency Conversion)*| |
|*Query 2: Find bids with specific auction ids and show their bid price.*| |
|*Query 3: Who is selling in particular US states?*| |
|*Query 4: What is the average selling price for each auction category?*| |
|*Query 5: Which auctions have seen the most bids in the last period?*| |
| | |

 


> python nexmark query pipelines
> --
>
> Key: BEAM-9904
> URL: https://issues.apache.org/jira/browse/BEAM-9904
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
>  
> ||query||java code||
> |*Query 0 (not part of original NexMark): 
> Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]|
> |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| |
> |*Query 2: Find bids with specific auction ids and show their bid price.*| |
> |*Query 3: Who is selling in particular US states?*| |
> |*Query 4: What is the average selling price for each auction category?*| |
> |*Query 5: Which auctions have seen the most bids in the last period?*| |
> | | |
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9904) python nexmark query pipelines

2020-05-06 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9904:
-

 Summary: python nexmark query pipelines
 Key: BEAM-9904
 URL: https://issues.apache.org/jira/browse/BEAM-9904
 Project: Beam
  Issue Type: Sub-task
  Components: benchmarking-py, testing-nexmark
Reporter: Yichi Zhang


 
||query||java code||
|*Query 0 (not part of original NexMark): 
Pass-through.*|[query0.java\|[https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]]|
|*Query 1: What are the bid values in Euro's? (Currency Conversion)*| |
|*Query 2: Find bids with specific auction ids and show their bid price.*| |
|*Query 3: Who is selling in particular US states?*| |
|*Query 4: What is the average selling price for each auction category?*| |
|*Query 5: Which auctions have seen the most bids in the last period?*| |
| | |

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9901) Beam python nexmark benchmark suite

2020-05-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9901:
--
Component/s: (was: sdk-py-core)

> Beam python nexmark benchmark suite
> ---
>
> Key: BEAM-9901
> URL: https://issues.apache.org/jira/browse/BEAM-9901
> Project: Beam
>  Issue Type: Task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
> Nexmark is a suite of queries (pipelines) used to measure performance and 
> non-regression in Beam. Currently it exists in java sdk: 
> [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark]
> In this project we would like to create the nexmark benchmark suite in python 
> sdk equivalent to what BEAM has for java. This allows us to determine 
> performance impact on pull requests for python pipelines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9902) python nexmark benchmark suite event generator

2020-05-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9902:
--
Component/s: (was: sdk-py-core)

> python nexmark benchmark suite event generator
> --
>
> Key: BEAM-9902
> URL: https://issues.apache.org/jira/browse/BEAM-9902
> Project: Beam
>  Issue Type: Sub-task
>  Components: benchmarking-py, testing-nexmark
>Reporter: Yichi Zhang
>Priority: Major
>
> Implement the event generator in python to create the source which the query 
> pipeline will read from.
> For reference, the java source generators can be found in 
> [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/sources]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9903) python nexmark benchmark suite launcher

2020-05-06 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9903:
-

 Summary: python nexmark benchmark suite launcher
 Key: BEAM-9903
 URL: https://issues.apache.org/jira/browse/BEAM-9903
 Project: Beam
  Issue Type: Sub-task
  Components: benchmarking-py, testing-nexmark
Reporter: Yichi Zhang


Create the launcher as well as the configurations for benchmark entry program.

For reference:

the java launcher: 
[https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/NexmarkLauncher.java]

and its configuration: 
[https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/NexmarkOptions.java]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9902) python nexmark benchmark suite event generator

2020-05-06 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9902:
-

 Summary: python nexmark benchmark suite event generator
 Key: BEAM-9902
 URL: https://issues.apache.org/jira/browse/BEAM-9902
 Project: Beam
  Issue Type: Sub-task
  Components: benchmarking-py, sdk-py-core, testing-nexmark
Reporter: Yichi Zhang


Implement the event generator in python to create the source which the query 
pipeline will read from.

For reference, the java source generators can be found in 
[https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/sources]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9901) Beam python nexmark benchmark suite

2020-05-06 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9901:
-

 Summary: Beam python nexmark benchmark suite
 Key: BEAM-9901
 URL: https://issues.apache.org/jira/browse/BEAM-9901
 Project: Beam
  Issue Type: Task
  Components: benchmarking-py, sdk-py-core, testing-nexmark
Reporter: Yichi Zhang


Nexmark is a suite of queries (pipelines) used to measure performance and 
non-regression in Beam. Currently it exists in java sdk: 
[https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark]

In this project we would like to create the nexmark benchmark suite in python 
sdk equivalent to what BEAM has for java. This allows us to determine 
performance impact on pull requests for python pipelines.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9727) Auto populate required feature experiment flags for enable dataflow runner v2

2020-04-08 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9727:
-

 Summary: Auto populate required feature experiment flags for 
enable dataflow runner v2
 Key: BEAM-9727
 URL: https://issues.apache.org/jira/browse/BEAM-9727
 Project: Beam
  Issue Type: Task
  Components: runner-dataflow
Reporter: Yichi Zhang
Assignee: Yichi Zhang
 Fix For: 2.21.0






--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2020-03-31 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071998#comment-17071998
 ] 

Yichi Zhang commented on BEAM-8944:
---

[~mxm] the checkpoint duration is lower means that without 
UnboundedThreadPoolExecutor FlinkRunner is faster? 

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Critical
> Fix For: 2.18.0
>
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> {code:python}
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>    with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> {code}
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results

2020-03-31 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071971#comment-17071971
 ] 

Yichi Zhang commented on BEAM-9649:
---

cc: +[~chamikara] fyi

Not particularly sure when it started whether it is a test setup issue or a 
bug, when the test was setup it always passes. I'll try to take a look when I 
have free cycles.

> beam_python_mongoio_load_test started failing due to mismatched results
> ---
>
> Key: BEAM-9649
> URL: https://issues.apache.org/jira/browse/BEAM-9649
> Project: Beam
>  Issue Type: Task
>  Components: io-py-mongodb
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Critical
> Attachments: j5vwSDNmTBK.png, mHP2wb3rdTG.png
>
>
> The load tests fail sometimes with a mismatched sum result for example
> [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]
> Seems sometimes the Read operation is not able to fetch all the data.
> !j5vwSDNmTBK.png|width=1005,height=752!
> !mHP2wb3rdTG.png|width=994,height=780!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results

2020-03-31 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9649:
--
Priority: Critical  (was: Minor)

> beam_python_mongoio_load_test started failing due to mismatched results
> ---
>
> Key: BEAM-9649
> URL: https://issues.apache.org/jira/browse/BEAM-9649
> Project: Beam
>  Issue Type: Task
>  Components: io-py-mongodb
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Critical
> Attachments: j5vwSDNmTBK.png, mHP2wb3rdTG.png
>
>
> The load tests fail sometimes with a mismatched sum result for example
> [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]
> Seems sometimes the Read operation is not able to fetch all the data.
> !j5vwSDNmTBK.png|width=1005,height=752!
> !mHP2wb3rdTG.png|width=994,height=780!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results

2020-03-31 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9649:
--
Attachment: mHP2wb3rdTG.png

> beam_python_mongoio_load_test started failing due to mismatched results
> ---
>
> Key: BEAM-9649
> URL: https://issues.apache.org/jira/browse/BEAM-9649
> Project: Beam
>  Issue Type: Task
>  Components: io-py-mongodb
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Minor
> Attachments: j5vwSDNmTBK.png, mHP2wb3rdTG.png
>
>
> The load tests fail sometimes with a mismatched sum result for example
> [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results

2020-03-31 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9649:
--
Description: 
The load tests fail sometimes with a mismatched sum result for example

[https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]

Seems sometimes the Read operation is not able to fetch all the data.



!j5vwSDNmTBK.png|width=1005,height=752!

!mHP2wb3rdTG.png|width=994,height=780!

  was:
The load tests fail sometimes with a mismatched sum result for example

[https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]


> beam_python_mongoio_load_test started failing due to mismatched results
> ---
>
> Key: BEAM-9649
> URL: https://issues.apache.org/jira/browse/BEAM-9649
> Project: Beam
>  Issue Type: Task
>  Components: io-py-mongodb
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Minor
> Attachments: j5vwSDNmTBK.png, mHP2wb3rdTG.png
>
>
> The load tests fail sometimes with a mismatched sum result for example
> [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]
> Seems sometimes the Read operation is not able to fetch all the data.
> !j5vwSDNmTBK.png|width=1005,height=752!
> !mHP2wb3rdTG.png|width=994,height=780!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results

2020-03-31 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9649:
--
Attachment: j5vwSDNmTBK.png

> beam_python_mongoio_load_test started failing due to mismatched results
> ---
>
> Key: BEAM-9649
> URL: https://issues.apache.org/jira/browse/BEAM-9649
> Project: Beam
>  Issue Type: Task
>  Components: io-py-mongodb
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Minor
> Attachments: j5vwSDNmTBK.png, mHP2wb3rdTG.png
>
>
> The load tests fail sometimes with a mismatched sum result for example
> [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results

2020-03-31 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9649:
--
Summary: beam_python_mongoio_load_test started failing due to mismatched 
results  (was: beam_python_mongoio_load_test started failing due to one 
mismatched results)

> beam_python_mongoio_load_test started failing due to mismatched results
> ---
>
> Key: BEAM-9649
> URL: https://issues.apache.org/jira/browse/BEAM-9649
> Project: Beam
>  Issue Type: Task
>  Components: io-py-mongodb
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Minor
>
> The load tests fail sometimes with a mismatched sum result for example
> [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to one mismatched results

2020-03-31 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9649:
--
Description: 
The load tests fail sometimes with a mismatched sum result for example

[https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]

  was:
The load tests fail sometimes with a mismatched result for example

[https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]

in failed runs it always fails with
missing elements [4950]
and the extra unexpected random element


> beam_python_mongoio_load_test started failing due to one mismatched results
> ---
>
> Key: BEAM-9649
> URL: https://issues.apache.org/jira/browse/BEAM-9649
> Project: Beam
>  Issue Type: Task
>  Components: io-py-mongodb
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Minor
>
> The load tests fail sometimes with a mismatched sum result for example
> [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9649) beam_python_mongoio_load_test started failing due to one mismatched results

2020-03-31 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9649:
-

 Summary: beam_python_mongoio_load_test started failing due to one 
mismatched results
 Key: BEAM-9649
 URL: https://issues.apache.org/jira/browse/BEAM-9649
 Project: Beam
  Issue Type: Task
  Components: io-py-mongodb
Reporter: Yichi Zhang
Assignee: Yichi Zhang


The load tests fail sometimes with a mismatched result for example

[https://builds.apache.org/job/beam_python_mongoio_load_test/438/console]

in failed runs it always fails with
missing elements [4950]
and the extra unexpected random element



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9607) _SDFBoundedSourceWrapper should expose underlying source display_data

2020-03-25 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9607:
-

 Summary: _SDFBoundedSourceWrapper should expose underlying source 
display_data
 Key: BEAM-9607
 URL: https://issues.apache.org/jira/browse/BEAM-9607
 Project: Beam
  Issue Type: Task
  Components: io-py-gcp, sdk-py-core
Reporter: Yichi Zhang
Assignee: Boyuan Zhang


It seems that the _SDFBoundedSourceWrapper will hide the display data added to 
the source underneath. We should try to expose those data if it exists.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9336) beam_PostCommit_Py_ValCont tests timeout

2020-02-21 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang closed BEAM-9336.
-
Fix Version/s: Not applicable
   Resolution: Duplicate

>  beam_PostCommit_Py_ValCont tests timeout
> -
>
> Key: BEAM-9336
> URL: https://issues.apache.org/jira/browse/BEAM-9336
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Yichi Zhang
>Priority: Minor
>  Labels: currently-failing
> Fix For: Not applicable
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
>  
>  * [[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/]]
> Initial investigation:
> The tests seem to fail due to the pytest global timeout.
> 
> _After you've filled out the above details, please [assign the issue to an 
> individual|https://beam.apache.org/contribute/postcommits-guides/index.html#find_specialist].
>  Assignee should [treat test failures as 
> high-priority|https://beam.apache.org/contribute/postcommits-policies/#assigned-failing-test],
>  helping to fix the issue or find a more appropriate owner. See [Apache Beam 
> Post-Commit 
> Policies|https://beam.apache.org/contribute/postcommits-policies]._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9336) beam_PostCommit_Py_ValCont tests timeout

2020-02-19 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9336:
-

 Summary:  beam_PostCommit_Py_ValCont tests timeout
 Key: BEAM-9336
 URL: https://issues.apache.org/jira/browse/BEAM-9336
 Project: Beam
  Issue Type: Bug
  Components: test-failures
Reporter: Yichi Zhang


_Use this form to file an issue for test failure:_
 * [[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/]]

Initial investigation:

(Add any investigation notes so far)

_After you've filled out the above details, please [assign the issue to an 
individual|https://beam.apache.org/contribute/postcommits-guides/index.html#find_specialist].
 Assignee should [treat test failures as 
high-priority|https://beam.apache.org/contribute/postcommits-policies/#assigned-failing-test],
 helping to fix the issue or find a more appropriate owner. See [Apache Beam 
Post-Commit Policies|https://beam.apache.org/contribute/postcommits-policies]._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-9336) beam_PostCommit_Py_ValCont tests timeout

2020-02-19 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-9336:
--
Description: 
 
 * [[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/]]

Initial investigation:

The tests seem to fail due to the pytest global timeout.

_After you've filled out the above details, please [assign the issue to an 
individual|https://beam.apache.org/contribute/postcommits-guides/index.html#find_specialist].
 Assignee should [treat test failures as 
high-priority|https://beam.apache.org/contribute/postcommits-policies/#assigned-failing-test],
 helping to fix the issue or find a more appropriate owner. See [Apache Beam 
Post-Commit Policies|https://beam.apache.org/contribute/postcommits-policies]._

  was:
_Use this form to file an issue for test failure:_
 * [[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/]]

Initial investigation:

(Add any investigation notes so far)

_After you've filled out the above details, please [assign the issue to an 
individual|https://beam.apache.org/contribute/postcommits-guides/index.html#find_specialist].
 Assignee should [treat test failures as 
high-priority|https://beam.apache.org/contribute/postcommits-policies/#assigned-failing-test],
 helping to fix the issue or find a more appropriate owner. See [Apache Beam 
Post-Commit Policies|https://beam.apache.org/contribute/postcommits-policies]._


>  beam_PostCommit_Py_ValCont tests timeout
> -
>
> Key: BEAM-9336
> URL: https://issues.apache.org/jira/browse/BEAM-9336
> Project: Beam
>  Issue Type: Bug
>  Components: test-failures
>Reporter: Yichi Zhang
>Priority: Minor
>  Labels: currently-failing
>
>  
>  * [[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/]]
> Initial investigation:
> The tests seem to fail due to the pytest global timeout.
> 
> _After you've filled out the above details, please [assign the issue to an 
> individual|https://beam.apache.org/contribute/postcommits-guides/index.html#find_specialist].
>  Assignee should [treat test failures as 
> high-priority|https://beam.apache.org/contribute/postcommits-policies/#assigned-failing-test],
>  helping to fix the issue or find a more appropriate owner. See [Apache Beam 
> Post-Commit 
> Policies|https://beam.apache.org/contribute/postcommits-policies]._



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Closed] (BEAM-9334) beam_PreCommit_Java_Cron failed on org.apache.beam.runners.samza.runtime.SamzaTimerInternalsFactoryTest.testProcessingTimeTimers

2020-02-18 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang closed BEAM-9334.
-
Fix Version/s: Not applicable
   Resolution: Duplicate

> beam_PreCommit_Java_Cron failed on 
> org.apache.beam.runners.samza.runtime.SamzaTimerInternalsFactoryTest.testProcessingTimeTimers
> 
>
> Key: BEAM-9334
> URL: https://issues.apache.org/jira/browse/BEAM-9334
> Project: Beam
>  Issue Type: Task
>  Components: runner-samza
>Reporter: Yichi Zhang
>Priority: Major
> Fix For: Not applicable
>
>
> h3. Error Message
> org.apache.samza.SamzaException: Error opening RocksDB store beamStore at 
> location /tmp/store3
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9334) beam_PreCommit_Java_Cron failed on org.apache.beam.runners.samza.runtime.SamzaTimerInternalsFactoryTest.testProcessingTimeTimers

2020-02-18 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9334:
-

 Summary: beam_PreCommit_Java_Cron failed on 
org.apache.beam.runners.samza.runtime.SamzaTimerInternalsFactoryTest.testProcessingTimeTimers
 Key: BEAM-9334
 URL: https://issues.apache.org/jira/browse/BEAM-9334
 Project: Beam
  Issue Type: Task
  Components: runner-samza
Reporter: Yichi Zhang


h3. Error Message

org.apache.samza.SamzaException: Error opening RocksDB store beamStore at 
location /tmp/store3

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9263) Bump python sdk fnapi version to enable status reporting

2020-02-06 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9263:
-

 Summary: Bump python sdk fnapi version to enable status reporting
 Key: BEAM-9263
 URL: https://issues.apache.org/jira/browse/BEAM-9263
 Project: Beam
  Issue Type: Task
  Components: sdk-py-core
Affects Versions: 2.20.0
Reporter: Yichi Zhang
Assignee: Yichi Zhang


Bump python sdk fn api environment version to 8 for roll out the status feature 
for sdk harness status reporting.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8614) Expose SDK harness status to Runner through FnApi

2020-02-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-8614.
---
Fix Version/s: 2.20.0
   Resolution: Fixed

> Expose SDK harness status to Runner through FnApi
> -
>
> Key: BEAM-8614
> URL: https://issues.apache.org/jira/browse/BEAM-8614
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-harness, sdk-py-harness
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.20.0
>
>
> Expose SDK harness debug infomation to runner for better debuggability of SDK 
> harness running with beam fn api.
>  
> doc: 
> [https://docs.google.com/document/d/1W77buQtdSEIPUKd9zemAM38fb-x3CvOoaTF4P2mSxmI/edit#heading=h.mersh3vo53ar]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-9122) Add uses_keyed_state step property to python dataflow runner

2020-02-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-9122.
---
Fix Version/s: 2.19.0
   Resolution: Fixed

> Add uses_keyed_state step property to python dataflow runner
> 
>
> Key: BEAM-9122
> URL: https://issues.apache.org/jira/browse/BEAM-9122
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.19.0
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Add additional step property to dataflow job property when a DoFn is stateful 
> in python sdk. So that the backend runner can recognize stateful steps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8626) Implement status api handler in python sdk harness

2020-02-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-8626.
---
Fix Version/s: 2.20.0
   Resolution: Fixed

> Implement status api handler in python sdk harness
> --
>
> Key: BEAM-8626
> URL: https://issues.apache.org/jira/browse/BEAM-8626
> Project: Beam
>  Issue Type: Sub-task
>  Components: sdk-py-harness
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.20.0
>
>  Time Spent: 7.5h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8625) Implement servlet in Dataflow runner for sdk status query endpoint

2020-02-06 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-8625.
---
Fix Version/s: 2.20.0
   Resolution: Fixed

> Implement servlet in Dataflow runner for sdk status query endpoint
> --
>
> Key: BEAM-8625
> URL: https://issues.apache.org/jira/browse/BEAM-8625
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-dataflow
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.20.0
>
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9165) Clean up StatusServer from python sdk harness when status reporting over fnapi is enabled

2020-01-21 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9165:
-

 Summary: Clean up StatusServer from python sdk harness when status 
reporting over fnapi is enabled
 Key: BEAM-9165
 URL: https://issues.apache.org/jira/browse/BEAM-9165
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-harness
Reporter: Yichi Zhang


When SDK Harness reports status over fnapi, runner can expose individual SDK 
harness status, thus we probably won't need to embed a separate http server in 
python SDK harness anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-9122) Add uses_keyed_state step property to python dataflow runner

2020-01-14 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-9122:
-

 Summary: Add uses_keyed_state step property to python dataflow 
runner
 Key: BEAM-9122
 URL: https://issues.apache.org/jira/browse/BEAM-9122
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py-core
Reporter: Yichi Zhang


Add additional step property to dataflow job property when a DoFn is stateful 
in python sdk. So that the backend runner can recognize stateful steps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-9122) Add uses_keyed_state step property to python dataflow runner

2020-01-14 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-9122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang reassigned BEAM-9122:
-

Assignee: Yichi Zhang

> Add uses_keyed_state step property to python dataflow runner
> 
>
> Key: BEAM-9122
> URL: https://issues.apache.org/jira/browse/BEAM-9122
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Major
>
> Add additional step property to dataflow job property when a DoFn is stateful 
> in python sdk. So that the backend runner can recognize stateful steps.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8624) Implement FnService for status api in Dataflow runner

2020-01-14 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-8624.
---
Fix Version/s: 2.19.0
   Resolution: Fixed

> Implement FnService for status api in Dataflow runner
> -
>
> Key: BEAM-8624
> URL: https://issues.apache.org/jira/browse/BEAM-8624
> Project: Beam
>  Issue Type: Sub-task
>  Components: runner-dataflow
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.19.0
>
>  Time Spent: 16h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8824) Add support for allowed lateness in python sdk

2020-01-14 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-8824.
---
Fix Version/s: 2.18.0
   Resolution: Fixed

> Add support for allowed lateness in python sdk
> --
>
> Key: BEAM-8824
> URL: https://issues.apache.org/jira/browse/BEAM-8824
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.18.0
>
>  Time Spent: 9h 40m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8736) Support window allowed lateness in python sdk

2020-01-14 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-8736.
---
Fix Version/s: 2.18.0
   Resolution: Fixed

> Support window allowed lateness in python sdk
> -
>
> Key: BEAM-8736
> URL: https://issues.apache.org/jira/browse/BEAM-8736
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.18.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-26 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-8944.
---
Resolution: Fixed

Mitigation is merged, future investigation of how to improve performance will 
be followed in BEAM-8998

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.18.0
>
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> {code:python}
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>    with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> {code}
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-26 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-8944 started by Yichi Zhang.
-
> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.18.0
>
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> {code:python}
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>    with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> {code}
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Work started] (BEAM-8623) Add additional message field to Provision API response for passing status endpoint

2019-12-23 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on BEAM-8623 started by Yichi Zhang.
-
> Add additional message field to Provision API response for passing status 
> endpoint
> --
>
> Key: BEAM-8623
> URL: https://issues.apache.org/jira/browse/BEAM-8623
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Minor
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8623) Add additional message field to Provision API response for passing status endpoint

2019-12-23 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-8623.
---
Fix Version/s: 2.19.0
   Resolution: Fixed

> Add additional message field to Provision API response for passing status 
> endpoint
> --
>
> Key: BEAM-8623
> URL: https://issues.apache.org/jira/browse/BEAM-8623
> Project: Beam
>  Issue Type: Sub-task
>  Components: beam-model
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Minor
> Fix For: 2.19.0
>
>  Time Spent: 4h 20m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-21 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001802#comment-17001802
 ] 

Yichi Zhang commented on BEAM-8944:
---

I think so

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.18.0
>
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> {code:python}
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>    with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> {code}
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8998) Avoid excessive bundle progress polling in Dataflow Runner

2019-12-20 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001079#comment-17001079
 ] 

Yichi Zhang commented on BEAM-8998:
---

throttling is introduced to avoid expensive scheduling problem mentioned in 
BEAM-5791

> Avoid excessive bundle progress polling in Dataflow Runner
> --
>
> Key: BEAM-8998
> URL: https://issues.apache.org/jira/browse/BEAM-8998
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Yichi Zhang
>Priority: Major
>
> Dataflow Java runner uses 0.1 secs interval for polling bundle progress from 
> SDK Harness, and use the result to decide whether data transfer should be 
> throttled. This can potentially overload SDK Harness. 
> We should try to come up with a way to avoid the throttling and lower the 
> bundle progress request frequency significantly.
>  
> Code reference:
> frequency setting: 
> [https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-19 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000448#comment-17000448
 ] 

Yichi Zhang commented on BEAM-8944:
---

then yeah, it'll affect python streaming jobs (which is only on portable runner 
with fnapi).

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.18.0
>
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> {code:python}
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>    with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> {code}
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-19 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000341#comment-17000341
 ] 

Yichi Zhang edited comment on BEAM-8944 at 12/19/19 8:14 PM:
-

This bug doesn't affect current production runners since the change of using 
more threads in SDK Harness doesn't exist in current released beam versions 
(the Dataflow runner issue mentioned in #10387 TODO affect current production 
runners but has limited impact with this fix, and will be investigated later).


was (Author: yichi):
This bug doesn't affect current production runners since the change of using 
more threads in SDK Harness doesn't exist in current released beam versions 
(the Dataflow runner issue mentioned in #10387 TODO affect current production 
runners but has limited impact, and will be investigated later).

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Blocker
> Fix For: 2.18.0
>
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> {code:python}
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>    with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> {code}
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8998) Avoid excessive bundle progress polling in Dataflow Runner

2019-12-19 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8998:
--
Description: 
Dataflow Java runner uses 0.1 secs interval for polling bundle progress from 
SDK Harness, and use the result to decide whether data transfer should be 
throttled. This can potentially overload SDK Harness. 

We should try to come up with a way to avoid the throttling and lower the 
bundle progress request frequency significantly.

 

Code reference:

frequency setting: 
[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296]

  was:
Dataflow Java runner uses 0.1 secs interval for polling bundle progress from 
SDK Harness, and use the result to decide whether data delivery should be 
throttled. This can potentially overload SDK Harness. 

We should try to come up with a way to avoid the throttling and lower the 
bundle progress request frequency significantly.

 

Code reference:

frequency setting: 
[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296]


> Avoid excessive bundle progress polling in Dataflow Runner
> --
>
> Key: BEAM-8998
> URL: https://issues.apache.org/jira/browse/BEAM-8998
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Yichi Zhang
>Priority: Major
>
> Dataflow Java runner uses 0.1 secs interval for polling bundle progress from 
> SDK Harness, and use the result to decide whether data transfer should be 
> throttled. This can potentially overload SDK Harness. 
> We should try to come up with a way to avoid the throttling and lower the 
> bundle progress request frequency significantly.
>  
> Code reference:
> frequency setting: 
> [https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8998) Avoid excessive bundle progress polling in Dataflow Runner

2019-12-19 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8998:
--
Description: 
Dataflow Java runner uses 0.1 secs interval for polling bundle progress from 
SDK Harness, and use the result to decide whether data delivery should be 
throttled. This can potentially overload SDK Harness. 

We should try to come up with a way to avoid the throttling and lower the 
bundle progress request frequency significantly.

 

Code reference:

frequency setting: 
[https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296]

  was:
Dataflow Java runner uses 0.1 secs interval for polling bundle progress from 
SDK Harness, and use the result to decide whether data delivery should be 
throttled. This can potentially overload SDK Harness. 

We should try to come up with a way to avoid the throttling and lower the 
bundle progress request frequency significantly.


> Avoid excessive bundle progress polling in Dataflow Runner
> --
>
> Key: BEAM-8998
> URL: https://issues.apache.org/jira/browse/BEAM-8998
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Yichi Zhang
>Priority: Major
>
> Dataflow Java runner uses 0.1 secs interval for polling bundle progress from 
> SDK Harness, and use the result to decide whether data delivery should be 
> throttled. This can potentially overload SDK Harness. 
> We should try to come up with a way to avoid the throttling and lower the 
> bundle progress request frequency significantly.
>  
> Code reference:
> frequency setting: 
> [https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8998) Avoid excessive bundle progress polling in Dataflow Runner

2019-12-19 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8998:
--
Summary: Avoid excessive bundle progress polling in Dataflow Runner  (was: 
Avoid excessive bundle progress polling in JRH)

> Avoid excessive bundle progress polling in Dataflow Runner
> --
>
> Key: BEAM-8998
> URL: https://issues.apache.org/jira/browse/BEAM-8998
> Project: Beam
>  Issue Type: Improvement
>  Components: runner-dataflow
>Reporter: Yichi Zhang
>Priority: Major
>
> Dataflow Java runner uses 0.1 secs interval for polling bundle progress from 
> SDK Harness, and use the result to decide whether data delivery should be 
> throttled. This can potentially overload SDK Harness. 
> We should try to come up with a way to avoid the throttling and lower the 
> bundle progress request frequency significantly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8998) Avoid excessive bundle progress polling in JRH

2019-12-19 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-8998:
-

 Summary: Avoid excessive bundle progress polling in JRH
 Key: BEAM-8998
 URL: https://issues.apache.org/jira/browse/BEAM-8998
 Project: Beam
  Issue Type: Improvement
  Components: runner-dataflow
Reporter: Yichi Zhang


Dataflow Java runner uses 0.1 secs interval for polling bundle progress from 
SDK Harness, and use the result to decide whether data delivery should be 
throttled. This can potentially overload SDK Harness. 

We should try to come up with a way to avoid the throttling and lower the 
bundle progress request frequency significantly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-16 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Priority: Blocker  (was: Major)

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Blocker
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> {code:python}
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>    with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> {code}
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-16 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Issue Type: Bug  (was: Test)

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> {code:python}
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>    with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> {code}
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8886) Add a python mongodbio integration test that triggers load split

2019-12-12 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-8886.
---
Fix Version/s: 2.19.0
   Resolution: Fixed

> Add a python mongodbio integration test that triggers load split
> 
>
> Key: BEAM-8886
> URL: https://issues.apache.org/jira/browse/BEAM-8886
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py-core
>Reporter: Yichi Zhang
>Assignee: Yichi Zhang
>Priority: Minor
> Fix For: 2.19.0
>
>  Time Spent: 7h 40m
>  Remaining Estimate: 0h
>
> Current integration test doesn't seem to trigger liquid sharding at all, we 
> should change integration test that has more load and potentially use the 
> mongodb k8s cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (BEAM-8884) Python MongoDBIO TypeError when splitting

2019-12-12 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang resolved BEAM-8884.
---
Resolution: Fixed

> Python MongoDBIO TypeError when splitting
> -
>
> Key: BEAM-8884
> URL: https://issues.apache.org/jira/browse/BEAM-8884
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Brian Hulette
>Assignee: Yichi Zhang
>Priority: Major
> Fix For: 2.18.0
>
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> From [slack|https://the-asf.slack.com/archives/CBDNLQZM1/p1575350991134000]:
> I am trying to run a pipeline (defined with the Python SDK) on Dataflow that 
> uses beam.io.ReadFromMongoDB. When dealing with very small datasets (<10mb) 
> it runs fine, when trying to run it with slightly larger datasets (70mb), I 
> always get this error:
> {code:}
> TypeError: '<' not supported between instances of 'dict' and 'ObjectId'
> {code}
> Stack trace see below. Running it on a local machine works just fine. I would 
> highly appreciate any pointers what this could be.
> I hope this is the right channel do address this.
> {code:}
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 
> 649, in do_work
> work_executor.execute()
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 218, in execute
> self._split_task)
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 226, in _perform_source_split_considering_api_limits
> desired_bundle_size)
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 263, in _perform_source_split
> for split in source.split(desired_bundle_size):
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/mongodbio.py", 
> line 174, in split
> bundle_end = min(stop_position, split_key_id)
> TypeError: '<' not supported between instances of 'dict' and 'ObjectId'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-12 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Issue Type: Test  (was: Bug)

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Test
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> {code:python}
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>    with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> {code}
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994104#comment-16994104
 ] 

Yichi Zhang commented on BEAM-8944:
---

Seems like the original change was to solve deadlock and stuckness issue.

While the usage of UnboundedThreadPoolExecutor does seem to impact pipelines 
that saturate cpu usage (~90%) quite a bit, it has less effect on under loaded 
pipelines.

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> {code:python}
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>    with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> {code}
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Description: 
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 

{code:python}
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put(i)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

   with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)
{code}



Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:
UnboundedThreadPoolExecutor:
 !profiling.png! 

1 Thread ThreadPoolExecutor:
 !profiling_one_thread.png! 

12 Threads ThreadPoolExecutor:
 !profiling_twelve_threads.png! 





  was:
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:
UnboundedThreadPoolExecutor:
 !profiling.png! 

1 Thread ThreadPoolExecutor:
 !profiling_one_thread.png! 

12 Threads ThreadPoolExecutor:
 !profiling_twelve_threads.png! 






> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> {code:python}
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>    with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> {code}
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira

[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Attachment: profiling_twelve_threads.png

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Description: 
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:
UnboundedThreadPoolExecutor:
 !profiling.png! 

1 Thread ThreadPoolExecutor:
 !profiling_one_thread.png! 

12 Threads ThreadPoolExecutor:
 !profiling_twelve_threads.png! 





  was:
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:
UnboundedThreadPoolExecutor:
 !profiling.png! 

1 Thread ThreadPoolExecutor:
 !profiling_one_thread.png! 

12 Threads ThreadPoolExecutor:







> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png, profiling_one_thread.png, 
> profiling_twelve_threads.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:
>  !profiling_twelve_threads.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Description: 
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:
UnboundedThreadPoolExecutor:
 !profiling.png! 

1 Thread ThreadPoolExecutor:
 !profiling_one_thread.png! 

12 Threads ThreadPoolExecutor:






  was:
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:
UnboundedThreadPoolExecutor:
 !profiling.png! 

1 Thread ThreadPoolExecutor:
 !profiling_one_thread.png! 

12 Threads ThreadPoolExecutor:






> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png, profiling_one_thread.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Description: 
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:
UnboundedThreadPoolExecutor:
 !profiling.png! 

1 Thread ThreadPoolExecutor:
 !profiling_one_thread.png! 

12 Threads ThreadPoolExecutor:





  was:
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:
UnboundedThreadPoolExecutor:
 !profiling.png! 
1 Thread ThreadPoolExecutor:





> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png, profiling_one_thread.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:
>  !profiling_one_thread.png! 
> 12 Threads ThreadPoolExecutor:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Attachment: profiling_one_thread.png

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png, profiling_one_thread.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Description: 
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:
UnboundedThreadPoolExecutor:
 !profiling.png! 
1 Thread ThreadPoolExecutor:




  was:
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:
 !profiling.png! 




> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
> UnboundedThreadPoolExecutor:
>  !profiling.png! 
> 1 Thread ThreadPoolExecutor:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Description: 
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:
 !profiling.png! 



  was:
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:




> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:
>  !profiling.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Attachment: profiling.png

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Description: 
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

Profiling:



  was:
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```


> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
> Attachments: profiling.png
>
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```
> Profiling:



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-11 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Affects Version/s: (was: 2.17.0)

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-10 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993141#comment-16993141
 ] 

Yichi Zhang edited comment on BEAM-8944 at 12/11/19 2:28 AM:
-

CC: [~angoenka] [~lcwik]


was (Author: yichi):
CC: [~angoenka]

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.17.0, 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-10 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993141#comment-16993141
 ] 

Yichi Zhang commented on BEAM-8944:
---

CC: [~angoenka]

> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.17.0, 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-10 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Description: 
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 
def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)


Results:


 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

  was:
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 

```

def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)

``` 

Results:

```

 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```


> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.17.0, 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> Results:
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-10 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Description: 
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 

```

def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put\(i\)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)

``` 

Results:

```

 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

  was:
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 

```

def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put(i)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)

``` 

Results:

```

 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```


> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.17.0, 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> ```
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put\(i\)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> ``` 
> Results:
> ```
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-10 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang updated BEAM-8944:
--
Description: 
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 

```

def test_performance(self):
   def run_perf(executor):
     total_number = 100
     q = queue.Queue()

    def task(number):
       hash(number)
       q.put(number + 200)
       return number

    t = time.time()
     count = 0
     for i in range(200):
       q.put(i)

    while count < total_number:
       executor.submit(task, q.get(block=True))
       count += 1
     print('%s uses %s' % (executor, time.time() - t))

  with UnboundedThreadPoolExecutor() as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=1) as executor:
     run_perf(executor)
   with futures.ThreadPoolExecutor(max_workers=12) as executor:
     run_perf(executor)

``` 

Results:

```

 uses 268.160675049
  uses 
79.904583931
  uses 
191.179054976
 ```

  was:
We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 

```

def test_performance(self):
  def run_perf(executor):
    total_number = 100
    q = queue.Queue()

    def task(number):
      hash(number)
      q.put(number + 200)
      return number

    t = time.time()
    count = 0
    for i in range(200):
      q.put(i)

    while count < total_number:
      executor.submit(task, q.get(block=True))
      count += 1
    print('%s uses %s' % (executor, time.time() - t))


  with UnboundedThreadPoolExecutor() as executor:
    run_perf(executor)
  with futures.ThreadPoolExecutor(max_workers=1) as executor:
    run_perf(executor)
  with futures.ThreadPoolExecutor(max_workers=12) as executor:
    run_perf(executor)


``` 

Results:

```

 uses 268.160675049
 uses 
79.904583931
 uses 
191.179054976
```


> Python SDK harness performance degradation with UnboundedThreadPoolExecutor
> ---
>
> Key: BEAM-8944
> URL: https://issues.apache.org/jira/browse/BEAM-8944
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-harness
>Affects Versions: 2.17.0, 2.18.0
>Reporter: Yichi Zhang
>Priority: Major
>
> We are seeing a performance degradation for python streaming word count load 
> tests.
>  
> After some investigation, it appears to be caused by swapping the original 
> ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
> that python performance is worse with more threads on cpu-bounded tasks.
>  
> A simple test for comparing the multiple thread pool executor performance:
>  
> ```
> def test_performance(self):
>    def run_perf(executor):
>      total_number = 100
>      q = queue.Queue()
>     def task(number):
>        hash(number)
>        q.put(number + 200)
>        return number
>     t = time.time()
>      count = 0
>      for i in range(200):
>        q.put(i)
>     while count < total_number:
>        executor.submit(task, q.get(block=True))
>        count += 1
>      print('%s uses %s' % (executor, time.time() - t))
>   with UnboundedThreadPoolExecutor() as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=1) as executor:
>      run_perf(executor)
>    with futures.ThreadPoolExecutor(max_workers=12) as executor:
>      run_perf(executor)
> ``` 
> Results:
> ```
>  0x7fab400dbe50> uses 268.160675049
>   uses 
> 79.904583931
>   uses 
> 191.179054976
>  ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor

2019-12-10 Thread Yichi Zhang (Jira)
Yichi Zhang created BEAM-8944:
-

 Summary: Python SDK harness performance degradation with 
UnboundedThreadPoolExecutor
 Key: BEAM-8944
 URL: https://issues.apache.org/jira/browse/BEAM-8944
 Project: Beam
  Issue Type: Bug
  Components: sdk-py-harness
Affects Versions: 2.17.0, 2.18.0
Reporter: Yichi Zhang


We are seeing a performance degradation for python streaming word count load 
tests.

 

After some investigation, it appears to be caused by swapping the original 
ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is 
that python performance is worse with more threads on cpu-bounded tasks.

 

A simple test for comparing the multiple thread pool executor performance:

 

```

def test_performance(self):
  def run_perf(executor):
    total_number = 100
    q = queue.Queue()

    def task(number):
      hash(number)
      q.put(number + 200)
      return number

    t = time.time()
    count = 0
    for i in range(200):
      q.put(i)

    while count < total_number:
      executor.submit(task, q.get(block=True))
      count += 1
    print('%s uses %s' % (executor, time.time() - t))


  with UnboundedThreadPoolExecutor() as executor:
    run_perf(executor)
  with futures.ThreadPoolExecutor(max_workers=1) as executor:
    run_perf(executor)
  with futures.ThreadPoolExecutor(max_workers=12) as executor:
    run_perf(executor)


``` 

Results:

```

 uses 268.160675049
 uses 
79.904583931
 uses 
191.179054976
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (BEAM-8580) Request Python API to support windows ClosingBehavior

2019-12-09 Thread Yichi Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/BEAM-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yichi Zhang reassigned BEAM-8580:
-

Assignee: Yichi Zhang

> Request Python API to support windows ClosingBehavior
> -
>
> Key: BEAM-8580
> URL: https://issues.apache.org/jira/browse/BEAM-8580
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: wendy liu
>Assignee: Yichi Zhang
>Priority: Major
>
> Beam Python should have an API to support windows ClosingBehavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (BEAM-8884) Python MongoDBIO TypeError when splitting

2019-12-03 Thread Yichi Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/BEAM-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987433#comment-16987433
 ] 

Yichi Zhang edited comment on BEAM-8884 at 12/4/19 2:23 AM:


Created another Jira  https://issues.apache.org/jira/browse/BEAM-8886 for a 
follow up to cover the failure scenario in test.


was (Author: yichi):
Created another Jira  https://issues.apache.org/jira/browse/BEAM-8886 for a 
follow up to cover the failure scenario.

> Python MongoDBIO TypeError when splitting
> -
>
> Key: BEAM-8884
> URL: https://issues.apache.org/jira/browse/BEAM-8884
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Brian Hulette
>Assignee: Yichi Zhang
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> From [slack|https://the-asf.slack.com/archives/CBDNLQZM1/p1575350991134000]:
> I am trying to run a pipeline (defined with the Python SDK) on Dataflow that 
> uses beam.io.ReadFromMongoDB. When dealing with very small datasets (<10mb) 
> it runs fine, when trying to run it with slightly larger datasets (70mb), I 
> always get this error:
> {code:}
> TypeError: '<' not supported between instances of 'dict' and 'ObjectId'
> {code}
> Stack trace see below. Running it on a local machine works just fine. I would 
> highly appreciate any pointers what this could be.
> I hope this is the right channel do address this.
> {code:}
> Traceback (most recent call last):
>   File 
> "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 
> 649, in do_work
> work_executor.execute()
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 218, in execute
> self._split_task)
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 226, in _perform_source_split_considering_api_limits
> desired_bundle_size)
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", 
> line 263, in _perform_source_split
> for split in source.split(desired_bundle_size):
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/mongodbio.py", 
> line 174, in split
> bundle_end = min(stop_position, split_key_id)
> TypeError: '<' not supported between instances of 'dict' and 'ObjectId'
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


  1   2   >