[jira] [Created] (BEAM-10189) Add ValueState to python sdk
Yichi Zhang created BEAM-10189: -- Summary: Add ValueState to python sdk Key: BEAM-10189 URL: https://issues.apache.org/jira/browse/BEAM-10189 Project: Beam Issue Type: Improvement Components: sdk-py-core Reporter: Yichi Zhang Assignee: Yichi Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-9603) Support Dynamic Timer in Java SDK over FnApi
[ https://issues.apache.org/jira/browse/BEAM-9603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-9603. --- Fix Version/s: 2.22.0 Resolution: Fixed > Support Dynamic Timer in Java SDK over FnApi > > > Key: BEAM-9603 > URL: https://issues.apache.org/jira/browse/BEAM-9603 > Project: Beam > Issue Type: New Feature > Components: sdk-java-harness >Reporter: Boyuan Zhang >Assignee: Yichi Zhang >Priority: P2 > Labels: stale-assigned > Fix For: 2.22.0 > > Time Spent: 8.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-9263) Bump python sdk fnapi version to enable status reporting
[ https://issues.apache.org/jira/browse/BEAM-9263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-9263. --- Fix Version/s: 2.20.0 Resolution: Fixed > Bump python sdk fnapi version to enable status reporting > > > Key: BEAM-9263 > URL: https://issues.apache.org/jira/browse/BEAM-9263 > Project: Beam > Issue Type: Task > Components: sdk-py-core >Affects Versions: 2.20.0 >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: P3 > Labels: stale-assigned > Fix For: 2.20.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Bump python sdk fn api environment version to 8 for roll out the status > feature for sdk harness status reporting. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10153) Java SDK Harness throws NPE for VoidCoder as key coder
[ https://issues.apache.org/jira/browse/BEAM-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-10153: --- Description: When using VoidCoder as the key coder, NullPointerException will be thrown for processing element. {code:java} java.lang.NullPointerException at org.apache.beam.model.fnexecution.v1.BeamFnApi$StateKey$BagUserState$Builder.setKey(BeamFnApi.java:39390) at org.apache.beam.fn.harness.state.FnApiStateAccessor.createBagUserStateKey(FnApiStateAccessor.java:444) at org.apache.beam.fn.harness.state.FnApiStateAccessor.bindValue(FnApiStateAccessor.java:195) at org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:327) at org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:317) at org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.state(FnApiDoFnRunner.java:1432) at org.apache.beam.sdk.transforms.ParDoTest$TimerTests$TwoTimerTest$1$DoFnInvoker.invokeProcessElement(Unknown Source) at org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:740) at org.apache.beam.fn.harness.FnApiDoFnRunner.access$700(FnApiDoFnRunner.java:132) at org.apache.beam.fn.harness.FnApiDoFnRunner$Factory.lambda$createRunnerForPTransform$1(FnApiDoFnRunner.java:203) at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:216) at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:179) at org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:204) at org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:106) at org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:295) at org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173) at org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} This fails validate runner tests such as https://github.com/apache/beam/blob/03d99dfa359f44a29a772fcc8ec8b0a237cab113/sdks/java/core/src/test/java/org/apache/beam/sdk/transforms/ParDoTest.java#L4120 on dataflow runner with beam_fn_api. was: When using VoidCoder as the key coder, NullPointerException will be thrown for accessing state. {code:java} java.lang.NullPointerException at org.apache.beam.model.fnexecution.v1.BeamFnApi$StateKey$BagUserState$Builder.setKey(BeamFnApi.java:39390) at org.apache.beam.fn.harness.state.FnApiStateAccessor.createBagUserStateKey(FnApiStateAccessor.java:444) at org.apache.beam.fn.harness.state.FnApiStateAccessor.bindValue(FnApiStateAccessor.java:195) at org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:327) at org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:317) at org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.state(FnApiDoFnRunner.java:1432) at org.apache.beam.sdk.transforms.ParDoTest$TimerTests$TwoTimerTest$1$DoFnInvoker.invokeProcessElement(Unknown Source) at org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:740) at org.apache.beam.fn.harness.FnApiDoFnRunner.access$700(FnApiDoFnRunner.java:132) at org.apache.beam.fn.harness.FnApiDoFnRunner$Factory.lambda$createRunnerForPTransform$1(FnApiDoFnRunner.java:203) at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:216) at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:179) at org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:204) at org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:106) at org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:295) at org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173) at org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157) at
[jira] [Created] (BEAM-10153) Java SDK Harness throws NPE for VoidCoder for key coder
Yichi Zhang created BEAM-10153: -- Summary: Java SDK Harness throws NPE for VoidCoder for key coder Key: BEAM-10153 URL: https://issues.apache.org/jira/browse/BEAM-10153 Project: Beam Issue Type: Bug Components: sdk-java-harness Reporter: Yichi Zhang When using VoidCoder as the key coder, NullPointerException will be thrown for accessing state. {code:java} java.lang.NullPointerException at org.apache.beam.model.fnexecution.v1.BeamFnApi$StateKey$BagUserState$Builder.setKey(BeamFnApi.java:39390) at org.apache.beam.fn.harness.state.FnApiStateAccessor.createBagUserStateKey(FnApiStateAccessor.java:444) at org.apache.beam.fn.harness.state.FnApiStateAccessor.bindValue(FnApiStateAccessor.java:195) at org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:327) at org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:317) at org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.state(FnApiDoFnRunner.java:1432) at org.apache.beam.sdk.transforms.ParDoTest$TimerTests$TwoTimerTest$1$DoFnInvoker.invokeProcessElement(Unknown Source) at org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:740) at org.apache.beam.fn.harness.FnApiDoFnRunner.access$700(FnApiDoFnRunner.java:132) at org.apache.beam.fn.harness.FnApiDoFnRunner$Factory.lambda$createRunnerForPTransform$1(FnApiDoFnRunner.java:203) at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:216) at org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:179) at org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:204) at org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:106) at org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:295) at org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173) at org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-10153) Java SDK Harness throws NPE for VoidCoder as key coder
[ https://issues.apache.org/jira/browse/BEAM-10153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-10153: --- Summary: Java SDK Harness throws NPE for VoidCoder as key coder (was: Java SDK Harness throws NPE for VoidCoder for key coder) > Java SDK Harness throws NPE for VoidCoder as key coder > -- > > Key: BEAM-10153 > URL: https://issues.apache.org/jira/browse/BEAM-10153 > Project: Beam > Issue Type: Bug > Components: sdk-java-harness >Reporter: Yichi Zhang >Priority: P2 > > When using VoidCoder as the key coder, NullPointerException will be thrown > for accessing state. > {code:java} > java.lang.NullPointerException > at > org.apache.beam.model.fnexecution.v1.BeamFnApi$StateKey$BagUserState$Builder.setKey(BeamFnApi.java:39390) > at > org.apache.beam.fn.harness.state.FnApiStateAccessor.createBagUserStateKey(FnApiStateAccessor.java:444) > at > org.apache.beam.fn.harness.state.FnApiStateAccessor.bindValue(FnApiStateAccessor.java:195) > at > org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:327) > at > org.apache.beam.sdk.state.StateSpecs$ValueStateSpec.bind(StateSpecs.java:317) > at > org.apache.beam.fn.harness.FnApiDoFnRunner$ProcessBundleContext.state(FnApiDoFnRunner.java:1432) > at > org.apache.beam.sdk.transforms.ParDoTest$TimerTests$TwoTimerTest$1$DoFnInvoker.invokeProcessElement(Unknown > Source) > at > org.apache.beam.fn.harness.FnApiDoFnRunner.processElementForParDo(FnApiDoFnRunner.java:740) > at > org.apache.beam.fn.harness.FnApiDoFnRunner.access$700(FnApiDoFnRunner.java:132) > at > org.apache.beam.fn.harness.FnApiDoFnRunner$Factory.lambda$createRunnerForPTransform$1(FnApiDoFnRunner.java:203) > at > org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:216) > at > org.apache.beam.fn.harness.data.PCollectionConsumerRegistry$MetricTrackingFnDataReceiver.accept(PCollectionConsumerRegistry.java:179) > at > org.apache.beam.fn.harness.BeamFnDataReadRunner.forwardElementToConsumer(BeamFnDataReadRunner.java:204) > at > org.apache.beam.fn.harness.data.QueueingBeamFnDataClient.drainAndBlock(QueueingBeamFnDataClient.java:106) > at > org.apache.beam.fn.harness.control.ProcessBundleHandler.processBundle(ProcessBundleHandler.java:295) > at > org.apache.beam.fn.harness.control.BeamFnControlClient.delegateOnInstructionRequestType(BeamFnControlClient.java:173) > at > org.apache.beam.fn.harness.control.BeamFnControlClient.lambda$processInstructionRequests$0(BeamFnControlClient.java:157) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-10112) Add python sdk state and timer examples to website
[ https://issues.apache.org/jira/browse/BEAM-10112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang reassigned BEAM-10112: -- Assignee: Yichi Zhang > Add python sdk state and timer examples to website > -- > > Key: BEAM-10112 > URL: https://issues.apache.org/jira/browse/BEAM-10112 > Project: Beam > Issue Type: Improvement > Components: website >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: P2 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-10112) Add python sdk state and timer examples to website
Yichi Zhang created BEAM-10112: -- Summary: Add python sdk state and timer examples to website Key: BEAM-10112 URL: https://issues.apache.org/jira/browse/BEAM-10112 Project: Beam Issue Type: Improvement Components: website Reporter: Yichi Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-9602) Support Dynamic Timer in Python SDK over FnApi
[ https://issues.apache.org/jira/browse/BEAM-9602?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang reassigned BEAM-9602: - Assignee: Yichi Zhang > Support Dynamic Timer in Python SDK over FnApi > -- > > Key: BEAM-9602 > URL: https://issues.apache.org/jira/browse/BEAM-9602 > Project: Beam > Issue Type: New Feature > Components: sdk-py-harness >Reporter: Boyuan Zhang >Assignee: Yichi Zhang >Priority: P2 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-10002) Mongo cursor timeout leads to CursorNotFound error
[ https://issues.apache.org/jira/browse/BEAM-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17108429#comment-17108429 ] Yichi Zhang commented on BEAM-10002: Thanks for the issue report [~corvin], do you want also contribute a PR to address it, since I'm not sure when I'll get time to work on this at the moment. > Mongo cursor timeout leads to CursorNotFound error > -- > > Key: BEAM-10002 > URL: https://issues.apache.org/jira/browse/BEAM-10002 > Project: Beam > Issue Type: Bug > Components: io-py-mongodb >Affects Versions: 2.20.0 >Reporter: Corvin Deboeser >Assignee: Yichi Zhang >Priority: Major > > If some work items take a lot of processing time and the cursor of a bundle > is not queried for too long, then mongodb will timeout the cursor which > results in > {code:java} > pymongo.errors.CursorNotFound: cursor id ... not found > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9940) Dataflow runner not setting timer family specs for TimerDeclaration annotation
Yichi Zhang created BEAM-9940: - Summary: Dataflow runner not setting timer family specs for TimerDeclaration annotation Key: BEAM-9940 URL: https://issues.apache.org/jira/browse/BEAM-9940 Project: Beam Issue Type: Bug Components: runner-dataflow Reporter: Yichi Zhang -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-9904) python nexmark query pipelines
[ https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang closed BEAM-9904. - Fix Version/s: Not applicable Resolution: Abandoned > python nexmark query pipelines > -- > > Key: BEAM-9904 > URL: https://issues.apache.org/jira/browse/BEAM-9904 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > Fix For: Not applicable > > > Create python beam pipelines for the data analytic queries in NexMark suite: > > ||query||corresponding java code|| > |*Query 0 (not part of original NexMark): > Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| > |*Query 1: What are the bid values in Euro's? (Currency > Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query1.java]| > |*Query 2: Find bids with specific auction ids and show their bid > price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query2.java]| > |*Query 3: Who is selling in particular US > states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query3.java]| > |*Query 4: What is the average selling price for each auction > category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query4.java]| > |*Query 5: Which auctions have seen the most bids in the last > period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query5.java]| > | | | > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-9905) python nexmark benchmark suite metrics
[ https://issues.apache.org/jira/browse/BEAM-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang closed BEAM-9905. - Fix Version/s: Not applicable Resolution: Abandoned > python nexmark benchmark suite metrics > -- > > Key: BEAM-9905 > URL: https://issues.apache.org/jira/browse/BEAM-9905 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > Fix For: Not applicable > > > Ensure we can collect output metrics from the query pipelines with the > jenkins test infra such as: > execution time, processing event rate, number of results, also invalid > auctions/bids, … -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-9902) python nexmark benchmark suite event generator
[ https://issues.apache.org/jira/browse/BEAM-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang closed BEAM-9902. - Fix Version/s: Not applicable Resolution: Abandoned > python nexmark benchmark suite event generator > -- > > Key: BEAM-9902 > URL: https://issues.apache.org/jira/browse/BEAM-9902 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > Fix For: Not applicable > > > Implement the event generator in python to create the source which the query > pipeline will read from. > For reference, the java source generators can be found in > [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/sources] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-9903) python nexmark benchmark suite launcher
[ https://issues.apache.org/jira/browse/BEAM-9903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang closed BEAM-9903. - Fix Version/s: Not applicable Resolution: Abandoned > python nexmark benchmark suite launcher > --- > > Key: BEAM-9903 > URL: https://issues.apache.org/jira/browse/BEAM-9903 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > Fix For: Not applicable > > > Create the launcher as well as the configurations for benchmark entry program. > For reference: > the java launcher: > [https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/NexmarkLauncher.java] > and its configuration: > [https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/NexmarkOptions.java] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-9901) Beam python nexmark benchmark suite
[ https://issues.apache.org/jira/browse/BEAM-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang closed BEAM-9901. - Fix Version/s: Not applicable Resolution: Resolved > Beam python nexmark benchmark suite > --- > > Key: BEAM-9901 > URL: https://issues.apache.org/jira/browse/BEAM-9901 > Project: Beam > Issue Type: Task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > Fix For: Not applicable > > > Nexmark is a suite of queries (pipelines) used to measure performance and > non-regression in Beam. Currently it exists in java sdk: > [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark] > In this project we would like to create the nexmark benchmark suite in python > sdk equivalent to what BEAM has for java. This allows us to determine > performance impact on pull requests for python pipelines. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-9906) Beam python nexmark benchmark starter task
[ https://issues.apache.org/jira/browse/BEAM-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang closed BEAM-9906. - Fix Version/s: Not applicable Resolution: Abandoned > Beam python nexmark benchmark starter task > -- > > Key: BEAM-9906 > URL: https://issues.apache.org/jira/browse/BEAM-9906 > Project: Beam > Issue Type: Sub-task > Components: testing-nexmark >Reporter: Yichi Zhang >Priority: Major > Fix For: Not applicable > > > For this starter task, the assignee is expected to > * get familiar with beam model by browse through the beam website and > programming guide: https://beam.apache.org/ > * set up development environment: > https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides > * also get some familiarity with the NexMark suite: > http://datalab.cs.pdx.edu/niagara/pstream/nexmark.pdf > * it is also recommended to spend some time to get familiar with the git > workflow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9901) Beam python nexmark benchmark suite
[ https://issues.apache.org/jira/browse/BEAM-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101856#comment-17101856 ] Yichi Zhang commented on BEAM-9901: --- Ok, I guess we can use BEAM-8258 for the summer intern as well. I'll add corresponding dataflow components. > Beam python nexmark benchmark suite > --- > > Key: BEAM-9901 > URL: https://issues.apache.org/jira/browse/BEAM-9901 > Project: Beam > Issue Type: Task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > Nexmark is a suite of queries (pipelines) used to measure performance and > non-regression in Beam. Currently it exists in java sdk: > [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark] > In this project we would like to create the nexmark benchmark suite in python > sdk equivalent to what BEAM has for java. This allows us to determine > performance impact on pull requests for python pipelines. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9905) python nexmark benchmark suite metrics
[ https://issues.apache.org/jira/browse/BEAM-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101848#comment-17101848 ] Yichi Zhang commented on BEAM-9905: --- [~iemejia]It is more about being able to run the queries with test infra continuously, rephrased the description a little bit. > python nexmark benchmark suite metrics > -- > > Key: BEAM-9905 > URL: https://issues.apache.org/jira/browse/BEAM-9905 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > Ensure we can collect output metrics from the query pipelines with the > jenkins test infra such as: > execution time, processing event rate, number of results, also invalid > auctions/bids, … -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9905) python nexmark benchmark suite metrics
[ https://issues.apache.org/jira/browse/BEAM-9905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9905: -- Description: Ensure we can collect output metrics from the query pipelines with the jenkins test infra such as: execution time, processing event rate, number of results, also invalid auctions/bids, … was: Ensure we can collect output metrics from the query pipelines such as: execution time, processing event rate, number of results, also invalid auctions/bids, … > python nexmark benchmark suite metrics > -- > > Key: BEAM-9905 > URL: https://issues.apache.org/jira/browse/BEAM-9905 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > Ensure we can collect output metrics from the query pipelines with the > jenkins test infra such as: > execution time, processing event rate, number of results, also invalid > auctions/bids, … -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9901) Beam python nexmark benchmark suite
[ https://issues.apache.org/jira/browse/BEAM-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101845#comment-17101845 ] Yichi Zhang commented on BEAM-9901: --- [~iemejia] We are creating these tickets for google summer intern program, I think it might be helpful to limit the scope and create these tickets under one umbrella ticket to avoid distraction. We can deduplicate the tickets depending on how much progress has been made later. WDYT. > Beam python nexmark benchmark suite > --- > > Key: BEAM-9901 > URL: https://issues.apache.org/jira/browse/BEAM-9901 > Project: Beam > Issue Type: Task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > Nexmark is a suite of queries (pipelines) used to measure performance and > non-regression in Beam. Currently it exists in java sdk: > [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark] > In this project we would like to create the nexmark benchmark suite in python > sdk equivalent to what BEAM has for java. This allows us to determine > performance impact on pull requests for python pipelines. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9906) Beam python nextmark benchmark starter task
[ https://issues.apache.org/jira/browse/BEAM-9906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9906: -- Description: For this starter task, the assignee is expected to * get familiar with beam model by browse through the beam website and programming guide: https://beam.apache.org/ * set up development environment: https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides * also get some familiarity with the NexMark suite: http://datalab.cs.pdx.edu/niagara/pstream/nexmark.pdf * it is also recommended to spend some time to get familiar with the git workflow was: For this starter task, the assignee is expected to * get familiar with beam model by browse through the beam website and programming guide: https://beam.apache.org/ * set up development environment: https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides * also get some familiarity with the NexMark suite: http://datalab.cs.pdx.edu/niagara/pstream/nexmark.pdf * if not familiar with github and git, it is also recommended to spend some time to get familiar with the git workflow > Beam python nextmark benchmark starter task > --- > > Key: BEAM-9906 > URL: https://issues.apache.org/jira/browse/BEAM-9906 > Project: Beam > Issue Type: Sub-task > Components: testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > For this starter task, the assignee is expected to > * get familiar with beam model by browse through the beam website and > programming guide: https://beam.apache.org/ > * set up development environment: > https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides > * also get some familiarity with the NexMark suite: > http://datalab.cs.pdx.edu/niagara/pstream/nexmark.pdf > * it is also recommended to spend some time to get familiar with the git > workflow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9906) Beam python nextmark benchmark starter task
Yichi Zhang created BEAM-9906: - Summary: Beam python nextmark benchmark starter task Key: BEAM-9906 URL: https://issues.apache.org/jira/browse/BEAM-9906 Project: Beam Issue Type: Sub-task Components: testing-nexmark Reporter: Yichi Zhang For this starter task, the assignee is expected to * get familiar with beam model by browse through the beam website and programming guide: https://beam.apache.org/ * set up development environment: https://cwiki.apache.org/confluence/display/BEAM/Developer+Guides * also get some familiarity with the NexMark suite: http://datalab.cs.pdx.edu/niagara/pstream/nexmark.pdf * if not familiar with github and git, it is also recommended to spend some time to get familiar with the git workflow -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9905) python nexmark benchmark suite metrics
Yichi Zhang created BEAM-9905: - Summary: python nexmark benchmark suite metrics Key: BEAM-9905 URL: https://issues.apache.org/jira/browse/BEAM-9905 Project: Beam Issue Type: Sub-task Components: benchmarking-py, testing-nexmark Reporter: Yichi Zhang Ensure we can collect output metrics from the query pipelines such as: execution time, processing event rate, number of results, also invalid auctions/bids, … -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9904) python nexmark query pipelines
[ https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9904: -- Description: Create python beam pipelines for the data analytic queries in NexMark suite: ||query||corresponding java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]| |*Query 2: Find bids with specific auction ids and show their bid price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]| |*Query 3: Who is selling in particular US states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]| |*Query 4: What is the average selling price for each auction category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]| |*Query 5: Which auctions have seen the most bids in the last period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]| | | | was: ||query||java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]| |*Query 2: Find bids with specific auction ids and show their bid price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]| |*Query 3: Who is selling in particular US states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]| |*Query 4: What is the average selling price for each auction category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]| |*Query 5: Which auctions have seen the most bids in the last period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]| | | | > python nexmark query pipelines > -- > > Key: BEAM-9904 > URL: https://issues.apache.org/jira/browse/BEAM-9904 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > Create python beam pipelines for the data analytic queries in NexMark suite: > > ||query||corresponding java code|| > |*Query 0 (not part of original NexMark): > Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| > |*Query 1: What are the bid values in Euro's? (Currency > Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]| > |*Query 2: Find bids with specific auction ids and show their bid > price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]| > |*Query 3: Who is selling in particular US > states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]| > |*Query 4: What is the average selling price for each auction > category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]| > |*Query 5: Which auctions have seen the most bids in the last > period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]| > | | | > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9904) python nexmark query pipelines
[ https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9904: -- Description: Create python beam pipelines for the data analytic queries in NexMark suite: ||query||corresponding java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query1.java]| |*Query 2: Find bids with specific auction ids and show their bid price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query2.java]| |*Query 3: Who is selling in particular US states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query3.java]| |*Query 4: What is the average selling price for each auction category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query4.java]| |*Query 5: Which auctions have seen the most bids in the last period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query5.java]| | | | was: Create python beam pipelines for the data analytic queries in NexMark suite: ||query||corresponding java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]| |*Query 2: Find bids with specific auction ids and show their bid price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]| |*Query 3: Who is selling in particular US states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]| |*Query 4: What is the average selling price for each auction category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]| |*Query 5: Which auctions have seen the most bids in the last period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]| | | | > python nexmark query pipelines > -- > > Key: BEAM-9904 > URL: https://issues.apache.org/jira/browse/BEAM-9904 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > Create python beam pipelines for the data analytic queries in NexMark suite: > > ||query||corresponding java code|| > |*Query 0 (not part of original NexMark): > Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| > |*Query 1: What are the bid values in Euro's? (Currency > Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query1.java]| > |*Query 2: Find bids with specific auction ids and show their bid > price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query2.java]| > |*Query 3: Who is selling in particular US > states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query3.java]| > |*Query 4: What is the average selling price for each auction > category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query4.java]| > |*Query 5: Which auctions have seen the most bids in the last > period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query5.java]| > | | | > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9904) python nexmark query pipelines
[ https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9904: -- Description: ||query||java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]| |*Query 2: Find bids with specific auction ids and show their bid price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]| |*Query 3: Who is selling in particular US states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]| |*Query 4: What is the average selling price for each auction category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]| |*Query 5: Which auctions have seen the most bids in the last period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]| | | | was: ||query||java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| [query1.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]| |*Query 2: Find bids with specific auction ids and show their bid price.*| [query2.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]| |*Query 3: Who is selling in particular US states?*| [query3.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]| |*Query 4: What is the average selling price for each auction category?*| [query4.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]| |*Query 5: Which auctions have seen the most bids in the last period?*| [query5.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]| | | | > python nexmark query pipelines > -- > > Key: BEAM-9904 > URL: https://issues.apache.org/jira/browse/BEAM-9904 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > > ||query||java code|| > |*Query 0 (not part of original NexMark): > Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| > |*Query 1: What are the bid values in Euro's? (Currency > Conversion)*|[query1.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]| > |*Query 2: Find bids with specific auction ids and show their bid > price.*|[query2.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]| > |*Query 3: Who is selling in particular US > states?*|[query3.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]| > |*Query 4: What is the average selling price for each auction > category?*|[query4.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]| > |*Query 5: Which auctions have seen the most bids in the last > period?*|[query5.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]| > | | | > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9904) python nexmark query pipelines
[ https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9904: -- Description: ||query||java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| [query1.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]| |*Query 2: Find bids with specific auction ids and show their bid price.*| [query2.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]| |*Query 3: Who is selling in particular US states?*| [query3.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]| |*Query 4: What is the average selling price for each auction category?*| [query4.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]| |*Query 5: Which auctions have seen the most bids in the last period?*| [query5.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]| | | | was: ||query||java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| | |*Query 2: Find bids with specific auction ids and show their bid price.*| | |*Query 3: Who is selling in particular US states?*| | |*Query 4: What is the average selling price for each auction category?*| | |*Query 5: Which auctions have seen the most bids in the last period?*| | | | | > python nexmark query pipelines > -- > > Key: BEAM-9904 > URL: https://issues.apache.org/jira/browse/BEAM-9904 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > > ||query||java code|| > |*Query 0 (not part of original NexMark): > Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| > |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| > [query1.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query01.java]| > |*Query 2: Find bids with specific auction ids and show their bid price.*| > [query2.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query02.java]| > |*Query 3: Who is selling in particular US states?*| > [query3.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query03.java]| > |*Query 4: What is the average selling price for each auction category?*| > [query4.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query04.java]| > |*Query 5: Which auctions have seen the most bids in the last period?*| > [query5.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query05.java]| > | | | > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9904) python nexmark query pipelines
[ https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9904: -- Description: ||query||java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[#query0.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| | |*Query 2: Find bids with specific auction ids and show their bid price.*| | |*Query 3: Who is selling in particular US states?*| | |*Query 4: What is the average selling price for each auction category?*| | |*Query 5: Which auctions have seen the most bids in the last period?*| | | | | was: ||query||java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[query0.java\|[https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| | |*Query 2: Find bids with specific auction ids and show their bid price.*| | |*Query 3: Who is selling in particular US states?*| | |*Query 4: What is the average selling price for each auction category?*| | |*Query 5: Which auctions have seen the most bids in the last period?*| | | | | > python nexmark query pipelines > -- > > Key: BEAM-9904 > URL: https://issues.apache.org/jira/browse/BEAM-9904 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > > ||query||java code|| > |*Query 0 (not part of original NexMark): > Pass-through.*|[#query0.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| > |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| | > |*Query 2: Find bids with specific auction ids and show their bid price.*| | > |*Query 3: Who is selling in particular US states?*| | > |*Query 4: What is the average selling price for each auction category?*| | > |*Query 5: Which auctions have seen the most bids in the last period?*| | > | | | > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9904) python nexmark query pipelines
[ https://issues.apache.org/jira/browse/BEAM-9904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9904: -- Description: ||query||java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| | |*Query 2: Find bids with specific auction ids and show their bid price.*| | |*Query 3: Who is selling in particular US states?*| | |*Query 4: What is the average selling price for each auction category?*| | |*Query 5: Which auctions have seen the most bids in the last period?*| | | | | was: ||query||java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[#query0.java\|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| | |*Query 2: Find bids with specific auction ids and show their bid price.*| | |*Query 3: Who is selling in particular US states?*| | |*Query 4: What is the average selling price for each auction category?*| | |*Query 5: Which auctions have seen the most bids in the last period?*| | | | | > python nexmark query pipelines > -- > > Key: BEAM-9904 > URL: https://issues.apache.org/jira/browse/BEAM-9904 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > > ||query||java code|| > |*Query 0 (not part of original NexMark): > Pass-through.*|[query0.java|https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]| > |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| | > |*Query 2: Find bids with specific auction ids and show their bid price.*| | > |*Query 3: Who is selling in particular US states?*| | > |*Query 4: What is the average selling price for each auction category?*| | > |*Query 5: Which auctions have seen the most bids in the last period?*| | > | | | > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9904) python nexmark query pipelines
Yichi Zhang created BEAM-9904: - Summary: python nexmark query pipelines Key: BEAM-9904 URL: https://issues.apache.org/jira/browse/BEAM-9904 Project: Beam Issue Type: Sub-task Components: benchmarking-py, testing-nexmark Reporter: Yichi Zhang ||query||java code|| |*Query 0 (not part of original NexMark): Pass-through.*|[query0.java\|[https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/queries/Query0.java]]| |*Query 1: What are the bid values in Euro's? (Currency Conversion)*| | |*Query 2: Find bids with specific auction ids and show their bid price.*| | |*Query 3: Who is selling in particular US states?*| | |*Query 4: What is the average selling price for each auction category?*| | |*Query 5: Which auctions have seen the most bids in the last period?*| | | | | -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9901) Beam python nexmark benchmark suite
[ https://issues.apache.org/jira/browse/BEAM-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9901: -- Component/s: (was: sdk-py-core) > Beam python nexmark benchmark suite > --- > > Key: BEAM-9901 > URL: https://issues.apache.org/jira/browse/BEAM-9901 > Project: Beam > Issue Type: Task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > Nexmark is a suite of queries (pipelines) used to measure performance and > non-regression in Beam. Currently it exists in java sdk: > [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark] > In this project we would like to create the nexmark benchmark suite in python > sdk equivalent to what BEAM has for java. This allows us to determine > performance impact on pull requests for python pipelines. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9902) python nexmark benchmark suite event generator
[ https://issues.apache.org/jira/browse/BEAM-9902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9902: -- Component/s: (was: sdk-py-core) > python nexmark benchmark suite event generator > -- > > Key: BEAM-9902 > URL: https://issues.apache.org/jira/browse/BEAM-9902 > Project: Beam > Issue Type: Sub-task > Components: benchmarking-py, testing-nexmark >Reporter: Yichi Zhang >Priority: Major > > Implement the event generator in python to create the source which the query > pipeline will read from. > For reference, the java source generators can be found in > [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/sources] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9903) python nexmark benchmark suite launcher
Yichi Zhang created BEAM-9903: - Summary: python nexmark benchmark suite launcher Key: BEAM-9903 URL: https://issues.apache.org/jira/browse/BEAM-9903 Project: Beam Issue Type: Sub-task Components: benchmarking-py, testing-nexmark Reporter: Yichi Zhang Create the launcher as well as the configurations for benchmark entry program. For reference: the java launcher: [https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/NexmarkLauncher.java] and its configuration: [https://github.com/apache/beam/blob/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/NexmarkOptions.java] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9902) python nexmark benchmark suite event generator
Yichi Zhang created BEAM-9902: - Summary: python nexmark benchmark suite event generator Key: BEAM-9902 URL: https://issues.apache.org/jira/browse/BEAM-9902 Project: Beam Issue Type: Sub-task Components: benchmarking-py, sdk-py-core, testing-nexmark Reporter: Yichi Zhang Implement the event generator in python to create the source which the query pipeline will read from. For reference, the java source generators can be found in [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark/sources] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9901) Beam python nexmark benchmark suite
Yichi Zhang created BEAM-9901: - Summary: Beam python nexmark benchmark suite Key: BEAM-9901 URL: https://issues.apache.org/jira/browse/BEAM-9901 Project: Beam Issue Type: Task Components: benchmarking-py, sdk-py-core, testing-nexmark Reporter: Yichi Zhang Nexmark is a suite of queries (pipelines) used to measure performance and non-regression in Beam. Currently it exists in java sdk: [https://github.com/apache/beam/tree/master/sdks/java/testing/nexmark/src/main/java/org/apache/beam/sdk/nexmark] In this project we would like to create the nexmark benchmark suite in python sdk equivalent to what BEAM has for java. This allows us to determine performance impact on pull requests for python pipelines. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9727) Auto populate required feature experiment flags for enable dataflow runner v2
Yichi Zhang created BEAM-9727: - Summary: Auto populate required feature experiment flags for enable dataflow runner v2 Key: BEAM-9727 URL: https://issues.apache.org/jira/browse/BEAM-9727 Project: Beam Issue Type: Task Components: runner-dataflow Reporter: Yichi Zhang Assignee: Yichi Zhang Fix For: 2.21.0 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071998#comment-17071998 ] Yichi Zhang commented on BEAM-8944: --- [~mxm] the checkpoint duration is lower means that without UnboundedThreadPoolExecutor FlinkRunner is faster? > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Critical > Fix For: 2.18.0 > > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > Time Spent: 4h 20m > Remaining Estimate: 0h > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > {code:python} > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > {code} > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results
[ https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071971#comment-17071971 ] Yichi Zhang commented on BEAM-9649: --- cc: +[~chamikara] fyi Not particularly sure when it started whether it is a test setup issue or a bug, when the test was setup it always passes. I'll try to take a look when I have free cycles. > beam_python_mongoio_load_test started failing due to mismatched results > --- > > Key: BEAM-9649 > URL: https://issues.apache.org/jira/browse/BEAM-9649 > Project: Beam > Issue Type: Task > Components: io-py-mongodb >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Critical > Attachments: j5vwSDNmTBK.png, mHP2wb3rdTG.png > > > The load tests fail sometimes with a mismatched sum result for example > [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] > Seems sometimes the Read operation is not able to fetch all the data. > !j5vwSDNmTBK.png|width=1005,height=752! > !mHP2wb3rdTG.png|width=994,height=780! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results
[ https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9649: -- Priority: Critical (was: Minor) > beam_python_mongoio_load_test started failing due to mismatched results > --- > > Key: BEAM-9649 > URL: https://issues.apache.org/jira/browse/BEAM-9649 > Project: Beam > Issue Type: Task > Components: io-py-mongodb >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Critical > Attachments: j5vwSDNmTBK.png, mHP2wb3rdTG.png > > > The load tests fail sometimes with a mismatched sum result for example > [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] > Seems sometimes the Read operation is not able to fetch all the data. > !j5vwSDNmTBK.png|width=1005,height=752! > !mHP2wb3rdTG.png|width=994,height=780! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results
[ https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9649: -- Attachment: mHP2wb3rdTG.png > beam_python_mongoio_load_test started failing due to mismatched results > --- > > Key: BEAM-9649 > URL: https://issues.apache.org/jira/browse/BEAM-9649 > Project: Beam > Issue Type: Task > Components: io-py-mongodb >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Minor > Attachments: j5vwSDNmTBK.png, mHP2wb3rdTG.png > > > The load tests fail sometimes with a mismatched sum result for example > [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results
[ https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9649: -- Description: The load tests fail sometimes with a mismatched sum result for example [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] Seems sometimes the Read operation is not able to fetch all the data. !j5vwSDNmTBK.png|width=1005,height=752! !mHP2wb3rdTG.png|width=994,height=780! was: The load tests fail sometimes with a mismatched sum result for example [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] > beam_python_mongoio_load_test started failing due to mismatched results > --- > > Key: BEAM-9649 > URL: https://issues.apache.org/jira/browse/BEAM-9649 > Project: Beam > Issue Type: Task > Components: io-py-mongodb >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Minor > Attachments: j5vwSDNmTBK.png, mHP2wb3rdTG.png > > > The load tests fail sometimes with a mismatched sum result for example > [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] > Seems sometimes the Read operation is not able to fetch all the data. > !j5vwSDNmTBK.png|width=1005,height=752! > !mHP2wb3rdTG.png|width=994,height=780! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results
[ https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9649: -- Attachment: j5vwSDNmTBK.png > beam_python_mongoio_load_test started failing due to mismatched results > --- > > Key: BEAM-9649 > URL: https://issues.apache.org/jira/browse/BEAM-9649 > Project: Beam > Issue Type: Task > Components: io-py-mongodb >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Minor > Attachments: j5vwSDNmTBK.png, mHP2wb3rdTG.png > > > The load tests fail sometimes with a mismatched sum result for example > [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to mismatched results
[ https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9649: -- Summary: beam_python_mongoio_load_test started failing due to mismatched results (was: beam_python_mongoio_load_test started failing due to one mismatched results) > beam_python_mongoio_load_test started failing due to mismatched results > --- > > Key: BEAM-9649 > URL: https://issues.apache.org/jira/browse/BEAM-9649 > Project: Beam > Issue Type: Task > Components: io-py-mongodb >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Minor > > The load tests fail sometimes with a mismatched sum result for example > [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9649) beam_python_mongoio_load_test started failing due to one mismatched results
[ https://issues.apache.org/jira/browse/BEAM-9649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9649: -- Description: The load tests fail sometimes with a mismatched sum result for example [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] was: The load tests fail sometimes with a mismatched result for example [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] in failed runs it always fails with missing elements [4950] and the extra unexpected random element > beam_python_mongoio_load_test started failing due to one mismatched results > --- > > Key: BEAM-9649 > URL: https://issues.apache.org/jira/browse/BEAM-9649 > Project: Beam > Issue Type: Task > Components: io-py-mongodb >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Minor > > The load tests fail sometimes with a mismatched sum result for example > [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9649) beam_python_mongoio_load_test started failing due to one mismatched results
Yichi Zhang created BEAM-9649: - Summary: beam_python_mongoio_load_test started failing due to one mismatched results Key: BEAM-9649 URL: https://issues.apache.org/jira/browse/BEAM-9649 Project: Beam Issue Type: Task Components: io-py-mongodb Reporter: Yichi Zhang Assignee: Yichi Zhang The load tests fail sometimes with a mismatched result for example [https://builds.apache.org/job/beam_python_mongoio_load_test/438/console] in failed runs it always fails with missing elements [4950] and the extra unexpected random element -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9607) _SDFBoundedSourceWrapper should expose underlying source display_data
Yichi Zhang created BEAM-9607: - Summary: _SDFBoundedSourceWrapper should expose underlying source display_data Key: BEAM-9607 URL: https://issues.apache.org/jira/browse/BEAM-9607 Project: Beam Issue Type: Task Components: io-py-gcp, sdk-py-core Reporter: Yichi Zhang Assignee: Boyuan Zhang It seems that the _SDFBoundedSourceWrapper will hide the display data added to the source underneath. We should try to expose those data if it exists. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-9336) beam_PostCommit_Py_ValCont tests timeout
[ https://issues.apache.org/jira/browse/BEAM-9336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang closed BEAM-9336. - Fix Version/s: Not applicable Resolution: Duplicate > beam_PostCommit_Py_ValCont tests timeout > - > > Key: BEAM-9336 > URL: https://issues.apache.org/jira/browse/BEAM-9336 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Yichi Zhang >Priority: Minor > Labels: currently-failing > Fix For: Not applicable > > Time Spent: 50m > Remaining Estimate: 0h > > > * [[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/]] > Initial investigation: > The tests seem to fail due to the pytest global timeout. > > _After you've filled out the above details, please [assign the issue to an > individual|https://beam.apache.org/contribute/postcommits-guides/index.html#find_specialist]. > Assignee should [treat test failures as > high-priority|https://beam.apache.org/contribute/postcommits-policies/#assigned-failing-test], > helping to fix the issue or find a more appropriate owner. See [Apache Beam > Post-Commit > Policies|https://beam.apache.org/contribute/postcommits-policies]._ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9336) beam_PostCommit_Py_ValCont tests timeout
Yichi Zhang created BEAM-9336: - Summary: beam_PostCommit_Py_ValCont tests timeout Key: BEAM-9336 URL: https://issues.apache.org/jira/browse/BEAM-9336 Project: Beam Issue Type: Bug Components: test-failures Reporter: Yichi Zhang _Use this form to file an issue for test failure:_ * [[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/]] Initial investigation: (Add any investigation notes so far) _After you've filled out the above details, please [assign the issue to an individual|https://beam.apache.org/contribute/postcommits-guides/index.html#find_specialist]. Assignee should [treat test failures as high-priority|https://beam.apache.org/contribute/postcommits-policies/#assigned-failing-test], helping to fix the issue or find a more appropriate owner. See [Apache Beam Post-Commit Policies|https://beam.apache.org/contribute/postcommits-policies]._ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-9336) beam_PostCommit_Py_ValCont tests timeout
[ https://issues.apache.org/jira/browse/BEAM-9336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-9336: -- Description: * [[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/]] Initial investigation: The tests seem to fail due to the pytest global timeout. _After you've filled out the above details, please [assign the issue to an individual|https://beam.apache.org/contribute/postcommits-guides/index.html#find_specialist]. Assignee should [treat test failures as high-priority|https://beam.apache.org/contribute/postcommits-policies/#assigned-failing-test], helping to fix the issue or find a more appropriate owner. See [Apache Beam Post-Commit Policies|https://beam.apache.org/contribute/postcommits-policies]._ was: _Use this form to file an issue for test failure:_ * [[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/]] Initial investigation: (Add any investigation notes so far) _After you've filled out the above details, please [assign the issue to an individual|https://beam.apache.org/contribute/postcommits-guides/index.html#find_specialist]. Assignee should [treat test failures as high-priority|https://beam.apache.org/contribute/postcommits-policies/#assigned-failing-test], helping to fix the issue or find a more appropriate owner. See [Apache Beam Post-Commit Policies|https://beam.apache.org/contribute/postcommits-policies]._ > beam_PostCommit_Py_ValCont tests timeout > - > > Key: BEAM-9336 > URL: https://issues.apache.org/jira/browse/BEAM-9336 > Project: Beam > Issue Type: Bug > Components: test-failures >Reporter: Yichi Zhang >Priority: Minor > Labels: currently-failing > > > * [[https://builds.apache.org/job/beam_PostCommit_Py_ValCont/]] > Initial investigation: > The tests seem to fail due to the pytest global timeout. > > _After you've filled out the above details, please [assign the issue to an > individual|https://beam.apache.org/contribute/postcommits-guides/index.html#find_specialist]. > Assignee should [treat test failures as > high-priority|https://beam.apache.org/contribute/postcommits-policies/#assigned-failing-test], > helping to fix the issue or find a more appropriate owner. See [Apache Beam > Post-Commit > Policies|https://beam.apache.org/contribute/postcommits-policies]._ -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Closed] (BEAM-9334) beam_PreCommit_Java_Cron failed on org.apache.beam.runners.samza.runtime.SamzaTimerInternalsFactoryTest.testProcessingTimeTimers
[ https://issues.apache.org/jira/browse/BEAM-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang closed BEAM-9334. - Fix Version/s: Not applicable Resolution: Duplicate > beam_PreCommit_Java_Cron failed on > org.apache.beam.runners.samza.runtime.SamzaTimerInternalsFactoryTest.testProcessingTimeTimers > > > Key: BEAM-9334 > URL: https://issues.apache.org/jira/browse/BEAM-9334 > Project: Beam > Issue Type: Task > Components: runner-samza >Reporter: Yichi Zhang >Priority: Major > Fix For: Not applicable > > > h3. Error Message > org.apache.samza.SamzaException: Error opening RocksDB store beamStore at > location /tmp/store3 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9334) beam_PreCommit_Java_Cron failed on org.apache.beam.runners.samza.runtime.SamzaTimerInternalsFactoryTest.testProcessingTimeTimers
Yichi Zhang created BEAM-9334: - Summary: beam_PreCommit_Java_Cron failed on org.apache.beam.runners.samza.runtime.SamzaTimerInternalsFactoryTest.testProcessingTimeTimers Key: BEAM-9334 URL: https://issues.apache.org/jira/browse/BEAM-9334 Project: Beam Issue Type: Task Components: runner-samza Reporter: Yichi Zhang h3. Error Message org.apache.samza.SamzaException: Error opening RocksDB store beamStore at location /tmp/store3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9263) Bump python sdk fnapi version to enable status reporting
Yichi Zhang created BEAM-9263: - Summary: Bump python sdk fnapi version to enable status reporting Key: BEAM-9263 URL: https://issues.apache.org/jira/browse/BEAM-9263 Project: Beam Issue Type: Task Components: sdk-py-core Affects Versions: 2.20.0 Reporter: Yichi Zhang Assignee: Yichi Zhang Bump python sdk fn api environment version to 8 for roll out the status feature for sdk harness status reporting. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8614) Expose SDK harness status to Runner through FnApi
[ https://issues.apache.org/jira/browse/BEAM-8614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-8614. --- Fix Version/s: 2.20.0 Resolution: Fixed > Expose SDK harness status to Runner through FnApi > - > > Key: BEAM-8614 > URL: https://issues.apache.org/jira/browse/BEAM-8614 > Project: Beam > Issue Type: New Feature > Components: sdk-java-harness, sdk-py-harness >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Major > Fix For: 2.20.0 > > > Expose SDK harness debug infomation to runner for better debuggability of SDK > harness running with beam fn api. > > doc: > [https://docs.google.com/document/d/1W77buQtdSEIPUKd9zemAM38fb-x3CvOoaTF4P2mSxmI/edit#heading=h.mersh3vo53ar] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-9122) Add uses_keyed_state step property to python dataflow runner
[ https://issues.apache.org/jira/browse/BEAM-9122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-9122. --- Fix Version/s: 2.19.0 Resolution: Fixed > Add uses_keyed_state step property to python dataflow runner > > > Key: BEAM-9122 > URL: https://issues.apache.org/jira/browse/BEAM-9122 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Major > Fix For: 2.19.0 > > Time Spent: 2h > Remaining Estimate: 0h > > Add additional step property to dataflow job property when a DoFn is stateful > in python sdk. So that the backend runner can recognize stateful steps. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8626) Implement status api handler in python sdk harness
[ https://issues.apache.org/jira/browse/BEAM-8626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-8626. --- Fix Version/s: 2.20.0 Resolution: Fixed > Implement status api handler in python sdk harness > -- > > Key: BEAM-8626 > URL: https://issues.apache.org/jira/browse/BEAM-8626 > Project: Beam > Issue Type: Sub-task > Components: sdk-py-harness >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Major > Fix For: 2.20.0 > > Time Spent: 7.5h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8625) Implement servlet in Dataflow runner for sdk status query endpoint
[ https://issues.apache.org/jira/browse/BEAM-8625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-8625. --- Fix Version/s: 2.20.0 Resolution: Fixed > Implement servlet in Dataflow runner for sdk status query endpoint > -- > > Key: BEAM-8625 > URL: https://issues.apache.org/jira/browse/BEAM-8625 > Project: Beam > Issue Type: Sub-task > Components: runner-dataflow >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Major > Fix For: 2.20.0 > > Time Spent: 2h 10m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9165) Clean up StatusServer from python sdk harness when status reporting over fnapi is enabled
Yichi Zhang created BEAM-9165: - Summary: Clean up StatusServer from python sdk harness when status reporting over fnapi is enabled Key: BEAM-9165 URL: https://issues.apache.org/jira/browse/BEAM-9165 Project: Beam Issue Type: Improvement Components: sdk-py-harness Reporter: Yichi Zhang When SDK Harness reports status over fnapi, runner can expose individual SDK harness status, thus we probably won't need to embed a separate http server in python SDK harness anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-9122) Add uses_keyed_state step property to python dataflow runner
Yichi Zhang created BEAM-9122: - Summary: Add uses_keyed_state step property to python dataflow runner Key: BEAM-9122 URL: https://issues.apache.org/jira/browse/BEAM-9122 Project: Beam Issue Type: Improvement Components: sdk-py-core Reporter: Yichi Zhang Add additional step property to dataflow job property when a DoFn is stateful in python sdk. So that the backend runner can recognize stateful steps. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-9122) Add uses_keyed_state step property to python dataflow runner
[ https://issues.apache.org/jira/browse/BEAM-9122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang reassigned BEAM-9122: - Assignee: Yichi Zhang > Add uses_keyed_state step property to python dataflow runner > > > Key: BEAM-9122 > URL: https://issues.apache.org/jira/browse/BEAM-9122 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Major > > Add additional step property to dataflow job property when a DoFn is stateful > in python sdk. So that the backend runner can recognize stateful steps. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8624) Implement FnService for status api in Dataflow runner
[ https://issues.apache.org/jira/browse/BEAM-8624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-8624. --- Fix Version/s: 2.19.0 Resolution: Fixed > Implement FnService for status api in Dataflow runner > - > > Key: BEAM-8624 > URL: https://issues.apache.org/jira/browse/BEAM-8624 > Project: Beam > Issue Type: Sub-task > Components: runner-dataflow >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Major > Fix For: 2.19.0 > > Time Spent: 16h > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8824) Add support for allowed lateness in python sdk
[ https://issues.apache.org/jira/browse/BEAM-8824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-8824. --- Fix Version/s: 2.18.0 Resolution: Fixed > Add support for allowed lateness in python sdk > -- > > Key: BEAM-8824 > URL: https://issues.apache.org/jira/browse/BEAM-8824 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Major > Fix For: 2.18.0 > > Time Spent: 9h 40m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8736) Support window allowed lateness in python sdk
[ https://issues.apache.org/jira/browse/BEAM-8736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-8736. --- Fix Version/s: 2.18.0 Resolution: Fixed > Support window allowed lateness in python sdk > - > > Key: BEAM-8736 > URL: https://issues.apache.org/jira/browse/BEAM-8736 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Major > Fix For: 2.18.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-8944. --- Resolution: Fixed Mitigation is merged, future investigation of how to improve performance will be followed in BEAM-8998 > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.18.0 > > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > Time Spent: 4h 20m > Remaining Estimate: 0h > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > {code:python} > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > {code} > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work started] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on BEAM-8944 started by Yichi Zhang. - > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.18.0 > > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > Time Spent: 4h 20m > Remaining Estimate: 0h > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > {code:python} > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > {code} > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Work started] (BEAM-8623) Add additional message field to Provision API response for passing status endpoint
[ https://issues.apache.org/jira/browse/BEAM-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on BEAM-8623 started by Yichi Zhang. - > Add additional message field to Provision API response for passing status > endpoint > -- > > Key: BEAM-8623 > URL: https://issues.apache.org/jira/browse/BEAM-8623 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Minor > Time Spent: 4h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8623) Add additional message field to Provision API response for passing status endpoint
[ https://issues.apache.org/jira/browse/BEAM-8623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-8623. --- Fix Version/s: 2.19.0 Resolution: Fixed > Add additional message field to Provision API response for passing status > endpoint > -- > > Key: BEAM-8623 > URL: https://issues.apache.org/jira/browse/BEAM-8623 > Project: Beam > Issue Type: Sub-task > Components: beam-model >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Minor > Fix For: 2.19.0 > > Time Spent: 4h 20m > Remaining Estimate: 0h > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001802#comment-17001802 ] Yichi Zhang commented on BEAM-8944: --- I think so > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.18.0 > > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > Time Spent: 3h > Remaining Estimate: 0h > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > {code:python} > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > {code} > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8998) Avoid excessive bundle progress polling in Dataflow Runner
[ https://issues.apache.org/jira/browse/BEAM-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17001079#comment-17001079 ] Yichi Zhang commented on BEAM-8998: --- throttling is introduced to avoid expensive scheduling problem mentioned in BEAM-5791 > Avoid excessive bundle progress polling in Dataflow Runner > -- > > Key: BEAM-8998 > URL: https://issues.apache.org/jira/browse/BEAM-8998 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Yichi Zhang >Priority: Major > > Dataflow Java runner uses 0.1 secs interval for polling bundle progress from > SDK Harness, and use the result to decide whether data transfer should be > throttled. This can potentially overload SDK Harness. > We should try to come up with a way to avoid the throttling and lower the > bundle progress request frequency significantly. > > Code reference: > frequency setting: > [https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000448#comment-17000448 ] Yichi Zhang commented on BEAM-8944: --- then yeah, it'll affect python streaming jobs (which is only on portable runner with fnapi). > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.18.0 > > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > {code:python} > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > {code} > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17000341#comment-17000341 ] Yichi Zhang edited comment on BEAM-8944 at 12/19/19 8:14 PM: - This bug doesn't affect current production runners since the change of using more threads in SDK Harness doesn't exist in current released beam versions (the Dataflow runner issue mentioned in #10387 TODO affect current production runners but has limited impact with this fix, and will be investigated later). was (Author: yichi): This bug doesn't affect current production runners since the change of using more threads in SDK Harness doesn't exist in current released beam versions (the Dataflow runner issue mentioned in #10387 TODO affect current production runners but has limited impact, and will be investigated later). > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Blocker > Fix For: 2.18.0 > > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > {code:python} > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > {code} > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8998) Avoid excessive bundle progress polling in Dataflow Runner
[ https://issues.apache.org/jira/browse/BEAM-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8998: -- Description: Dataflow Java runner uses 0.1 secs interval for polling bundle progress from SDK Harness, and use the result to decide whether data transfer should be throttled. This can potentially overload SDK Harness. We should try to come up with a way to avoid the throttling and lower the bundle progress request frequency significantly. Code reference: frequency setting: [https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296] was: Dataflow Java runner uses 0.1 secs interval for polling bundle progress from SDK Harness, and use the result to decide whether data delivery should be throttled. This can potentially overload SDK Harness. We should try to come up with a way to avoid the throttling and lower the bundle progress request frequency significantly. Code reference: frequency setting: [https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296] > Avoid excessive bundle progress polling in Dataflow Runner > -- > > Key: BEAM-8998 > URL: https://issues.apache.org/jira/browse/BEAM-8998 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Yichi Zhang >Priority: Major > > Dataflow Java runner uses 0.1 secs interval for polling bundle progress from > SDK Harness, and use the result to decide whether data transfer should be > throttled. This can potentially overload SDK Harness. > We should try to come up with a way to avoid the throttling and lower the > bundle progress request frequency significantly. > > Code reference: > frequency setting: > [https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8998) Avoid excessive bundle progress polling in Dataflow Runner
[ https://issues.apache.org/jira/browse/BEAM-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8998: -- Description: Dataflow Java runner uses 0.1 secs interval for polling bundle progress from SDK Harness, and use the result to decide whether data delivery should be throttled. This can potentially overload SDK Harness. We should try to come up with a way to avoid the throttling and lower the bundle progress request frequency significantly. Code reference: frequency setting: [https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296] was: Dataflow Java runner uses 0.1 secs interval for polling bundle progress from SDK Harness, and use the result to decide whether data delivery should be throttled. This can potentially overload SDK Harness. We should try to come up with a way to avoid the throttling and lower the bundle progress request frequency significantly. > Avoid excessive bundle progress polling in Dataflow Runner > -- > > Key: BEAM-8998 > URL: https://issues.apache.org/jira/browse/BEAM-8998 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Yichi Zhang >Priority: Major > > Dataflow Java runner uses 0.1 secs interval for polling bundle progress from > SDK Harness, and use the result to decide whether data delivery should be > throttled. This can potentially overload SDK Harness. > We should try to come up with a way to avoid the throttling and lower the > bundle progress request frequency significantly. > > Code reference: > frequency setting: > [https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/fn/control/BeamFnMapTaskExecutor.java#L296] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8998) Avoid excessive bundle progress polling in Dataflow Runner
[ https://issues.apache.org/jira/browse/BEAM-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8998: -- Summary: Avoid excessive bundle progress polling in Dataflow Runner (was: Avoid excessive bundle progress polling in JRH) > Avoid excessive bundle progress polling in Dataflow Runner > -- > > Key: BEAM-8998 > URL: https://issues.apache.org/jira/browse/BEAM-8998 > Project: Beam > Issue Type: Improvement > Components: runner-dataflow >Reporter: Yichi Zhang >Priority: Major > > Dataflow Java runner uses 0.1 secs interval for polling bundle progress from > SDK Harness, and use the result to decide whether data delivery should be > throttled. This can potentially overload SDK Harness. > We should try to come up with a way to avoid the throttling and lower the > bundle progress request frequency significantly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8998) Avoid excessive bundle progress polling in JRH
Yichi Zhang created BEAM-8998: - Summary: Avoid excessive bundle progress polling in JRH Key: BEAM-8998 URL: https://issues.apache.org/jira/browse/BEAM-8998 Project: Beam Issue Type: Improvement Components: runner-dataflow Reporter: Yichi Zhang Dataflow Java runner uses 0.1 secs interval for polling bundle progress from SDK Harness, and use the result to decide whether data delivery should be throttled. This can potentially overload SDK Harness. We should try to come up with a way to avoid the throttling and lower the bundle progress request frequency significantly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Priority: Blocker (was: Major) > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Blocker > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > Time Spent: 50m > Remaining Estimate: 0h > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > {code:python} > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > {code} > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Issue Type: Bug (was: Test) > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > Time Spent: 50m > Remaining Estimate: 0h > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > {code:python} > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > {code} > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8886) Add a python mongodbio integration test that triggers load split
[ https://issues.apache.org/jira/browse/BEAM-8886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-8886. --- Fix Version/s: 2.19.0 Resolution: Fixed > Add a python mongodbio integration test that triggers load split > > > Key: BEAM-8886 > URL: https://issues.apache.org/jira/browse/BEAM-8886 > Project: Beam > Issue Type: Improvement > Components: sdk-py-core >Reporter: Yichi Zhang >Assignee: Yichi Zhang >Priority: Minor > Fix For: 2.19.0 > > Time Spent: 7h 40m > Remaining Estimate: 0h > > Current integration test doesn't seem to trigger liquid sharding at all, we > should change integration test that has more load and potentially use the > mongodb k8s cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (BEAM-8884) Python MongoDBIO TypeError when splitting
[ https://issues.apache.org/jira/browse/BEAM-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang resolved BEAM-8884. --- Resolution: Fixed > Python MongoDBIO TypeError when splitting > - > > Key: BEAM-8884 > URL: https://issues.apache.org/jira/browse/BEAM-8884 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Brian Hulette >Assignee: Yichi Zhang >Priority: Major > Fix For: 2.18.0 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > From [slack|https://the-asf.slack.com/archives/CBDNLQZM1/p1575350991134000]: > I am trying to run a pipeline (defined with the Python SDK) on Dataflow that > uses beam.io.ReadFromMongoDB. When dealing with very small datasets (<10mb) > it runs fine, when trying to run it with slightly larger datasets (70mb), I > always get this error: > {code:} > TypeError: '<' not supported between instances of 'dict' and 'ObjectId' > {code} > Stack trace see below. Running it on a local machine works just fine. I would > highly appreciate any pointers what this could be. > I hope this is the right channel do address this. > {code:} > Traceback (most recent call last): > File > "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line > 649, in do_work > work_executor.execute() > File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", > line 218, in execute > self._split_task) > File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", > line 226, in _perform_source_split_considering_api_limits > desired_bundle_size) > File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", > line 263, in _perform_source_split > for split in source.split(desired_bundle_size): > File "/usr/local/lib/python3.7/site-packages/apache_beam/io/mongodbio.py", > line 174, in split > bundle_end = min(stop_position, split_key_id) > TypeError: '<' not supported between instances of 'dict' and 'ObjectId' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Issue Type: Test (was: Bug) > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Test > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > {code:python} > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > {code} > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16994104#comment-16994104 ] Yichi Zhang commented on BEAM-8944: --- Seems like the original change was to solve deadlock and stuckness issue. While the usage of UnboundedThreadPoolExecutor does seem to impact pipelines that saturate cpu usage (~90%) quite a bit, it has less effect on under loaded pipelines. > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > {code:python} > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > {code} > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Description: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: {code:python} def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put(i) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) {code} Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: UnboundedThreadPoolExecutor: !profiling.png! 1 Thread ThreadPoolExecutor: !profiling_one_thread.png! 12 Threads ThreadPoolExecutor: !profiling_twelve_threads.png! was: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: UnboundedThreadPoolExecutor: !profiling.png! 1 Thread ThreadPoolExecutor: !profiling_one_thread.png! 12 Threads ThreadPoolExecutor: !profiling_twelve_threads.png! > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > {code:python} > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > {code} > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Attachment: profiling_twelve_threads.png > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Description: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: UnboundedThreadPoolExecutor: !profiling.png! 1 Thread ThreadPoolExecutor: !profiling_one_thread.png! 12 Threads ThreadPoolExecutor: !profiling_twelve_threads.png! was: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: UnboundedThreadPoolExecutor: !profiling.png! 1 Thread ThreadPoolExecutor: !profiling_one_thread.png! 12 Threads ThreadPoolExecutor: > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png, profiling_one_thread.png, > profiling_twelve_threads.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: > !profiling_twelve_threads.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Description: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: UnboundedThreadPoolExecutor: !profiling.png! 1 Thread ThreadPoolExecutor: !profiling_one_thread.png! 12 Threads ThreadPoolExecutor: was: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: UnboundedThreadPoolExecutor: !profiling.png! 1 Thread ThreadPoolExecutor: !profiling_one_thread.png! 12 Threads ThreadPoolExecutor: > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png, profiling_one_thread.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Description: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: UnboundedThreadPoolExecutor: !profiling.png! 1 Thread ThreadPoolExecutor: !profiling_one_thread.png! 12 Threads ThreadPoolExecutor: was: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: UnboundedThreadPoolExecutor: !profiling.png! 1 Thread ThreadPoolExecutor: > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png, profiling_one_thread.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: > !profiling_one_thread.png! > 12 Threads ThreadPoolExecutor: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Attachment: profiling_one_thread.png > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png, profiling_one_thread.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Description: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: UnboundedThreadPoolExecutor: !profiling.png! 1 Thread ThreadPoolExecutor: was: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: !profiling.png! > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > UnboundedThreadPoolExecutor: > !profiling.png! > 1 Thread ThreadPoolExecutor: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Description: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: !profiling.png! was: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: > !profiling.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Attachment: profiling.png > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Description: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` Profiling: was: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > Attachments: profiling.png > > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` > Profiling: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Affects Version/s: (was: 2.17.0) > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.18.0 >Reporter: Yichi Zhang >Priority: Major > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993141#comment-16993141 ] Yichi Zhang edited comment on BEAM-8944 at 12/11/19 2:28 AM: - CC: [~angoenka] [~lcwik] was (Author: yichi): CC: [~angoenka] > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.17.0, 2.18.0 >Reporter: Yichi Zhang >Priority: Major > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993141#comment-16993141 ] Yichi Zhang commented on BEAM-8944: --- CC: [~angoenka] > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.17.0, 2.18.0 >Reporter: Yichi Zhang >Priority: Major > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Description: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) Results: uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` was: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: ``` def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) ``` Results: ``` uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.17.0, 2.18.0 >Reporter: Yichi Zhang >Priority: Major > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > Results: > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Description: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: ``` def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put\(i\) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) ``` Results: ``` uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` was: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: ``` def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put(i) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) ``` Results: ``` uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.17.0, 2.18.0 >Reporter: Yichi Zhang >Priority: Major > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > ``` > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put\(i\) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > ``` > Results: > ``` > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
[ https://issues.apache.org/jira/browse/BEAM-8944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang updated BEAM-8944: -- Description: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: ``` def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put(i) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) ``` Results: ``` uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` was: We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: ``` def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put(i) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) ``` Results: ``` uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` > Python SDK harness performance degradation with UnboundedThreadPoolExecutor > --- > > Key: BEAM-8944 > URL: https://issues.apache.org/jira/browse/BEAM-8944 > Project: Beam > Issue Type: Bug > Components: sdk-py-harness >Affects Versions: 2.17.0, 2.18.0 >Reporter: Yichi Zhang >Priority: Major > > We are seeing a performance degradation for python streaming word count load > tests. > > After some investigation, it appears to be caused by swapping the original > ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is > that python performance is worse with more threads on cpu-bounded tasks. > > A simple test for comparing the multiple thread pool executor performance: > > ``` > def test_performance(self): > def run_perf(executor): > total_number = 100 > q = queue.Queue() > def task(number): > hash(number) > q.put(number + 200) > return number > t = time.time() > count = 0 > for i in range(200): > q.put(i) > while count < total_number: > executor.submit(task, q.get(block=True)) > count += 1 > print('%s uses %s' % (executor, time.time() - t)) > with UnboundedThreadPoolExecutor() as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=1) as executor: > run_perf(executor) > with futures.ThreadPoolExecutor(max_workers=12) as executor: > run_perf(executor) > ``` > Results: > ``` > 0x7fab400dbe50> uses 268.160675049 > uses > 79.904583931 > uses > 191.179054976 > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (BEAM-8944) Python SDK harness performance degradation with UnboundedThreadPoolExecutor
Yichi Zhang created BEAM-8944: - Summary: Python SDK harness performance degradation with UnboundedThreadPoolExecutor Key: BEAM-8944 URL: https://issues.apache.org/jira/browse/BEAM-8944 Project: Beam Issue Type: Bug Components: sdk-py-harness Affects Versions: 2.17.0, 2.18.0 Reporter: Yichi Zhang We are seeing a performance degradation for python streaming word count load tests. After some investigation, it appears to be caused by swapping the original ThreadPoolExecutor to UnboundedThreadPoolExecutor in sdk worker. Suspicion is that python performance is worse with more threads on cpu-bounded tasks. A simple test for comparing the multiple thread pool executor performance: ``` def test_performance(self): def run_perf(executor): total_number = 100 q = queue.Queue() def task(number): hash(number) q.put(number + 200) return number t = time.time() count = 0 for i in range(200): q.put(i) while count < total_number: executor.submit(task, q.get(block=True)) count += 1 print('%s uses %s' % (executor, time.time() - t)) with UnboundedThreadPoolExecutor() as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=1) as executor: run_perf(executor) with futures.ThreadPoolExecutor(max_workers=12) as executor: run_perf(executor) ``` Results: ``` uses 268.160675049 uses 79.904583931 uses 191.179054976 ``` -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (BEAM-8580) Request Python API to support windows ClosingBehavior
[ https://issues.apache.org/jira/browse/BEAM-8580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yichi Zhang reassigned BEAM-8580: - Assignee: Yichi Zhang > Request Python API to support windows ClosingBehavior > - > > Key: BEAM-8580 > URL: https://issues.apache.org/jira/browse/BEAM-8580 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: wendy liu >Assignee: Yichi Zhang >Priority: Major > > Beam Python should have an API to support windows ClosingBehavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (BEAM-8884) Python MongoDBIO TypeError when splitting
[ https://issues.apache.org/jira/browse/BEAM-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987433#comment-16987433 ] Yichi Zhang edited comment on BEAM-8884 at 12/4/19 2:23 AM: Created another Jira https://issues.apache.org/jira/browse/BEAM-8886 for a follow up to cover the failure scenario in test. was (Author: yichi): Created another Jira https://issues.apache.org/jira/browse/BEAM-8886 for a follow up to cover the failure scenario. > Python MongoDBIO TypeError when splitting > - > > Key: BEAM-8884 > URL: https://issues.apache.org/jira/browse/BEAM-8884 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Brian Hulette >Assignee: Yichi Zhang >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > From [slack|https://the-asf.slack.com/archives/CBDNLQZM1/p1575350991134000]: > I am trying to run a pipeline (defined with the Python SDK) on Dataflow that > uses beam.io.ReadFromMongoDB. When dealing with very small datasets (<10mb) > it runs fine, when trying to run it with slightly larger datasets (70mb), I > always get this error: > {code:} > TypeError: '<' not supported between instances of 'dict' and 'ObjectId' > {code} > Stack trace see below. Running it on a local machine works just fine. I would > highly appreciate any pointers what this could be. > I hope this is the right channel do address this. > {code:} > Traceback (most recent call last): > File > "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line > 649, in do_work > work_executor.execute() > File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", > line 218, in execute > self._split_task) > File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", > line 226, in _perform_source_split_considering_api_limits > desired_bundle_size) > File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", > line 263, in _perform_source_split > for split in source.split(desired_bundle_size): > File "/usr/local/lib/python3.7/site-packages/apache_beam/io/mongodbio.py", > line 174, in split > bundle_end = min(stop_position, split_key_id) > TypeError: '<' not supported between instances of 'dict' and 'ObjectId' > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)