[
https://issues.apache.org/jira/browse/BEAM-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17548788#comment-17548788
]
Danny McCormick commented on BEAM-10774:
----------------------------------------
This issue has been migrated to https://github.com/apache/beam/issues/20403
> GBK Python streaming load tests are too slow
> --------------------------------------------
>
> Key: BEAM-10774
> URL: https://issues.apache.org/jira/browse/BEAM-10774
> Project: Beam
> Issue Type: Bug
> Components: runner-dataflow, sdk-py-core, testing
> Reporter: Kamil Wasilewski
> Priority: P3
>
> The following GBK streaming test cases take too long on Dataflow:
>
> 1) 2GB of 10B records
> 2) 2GB of 100B records
> 4) fanout 4 times with 2GB 10-byte records total
> 5) fanout 8 times with 2GB 10-byte records total
>
> Each of them needs at least 1 hour to execute, which is way too long for one
> Jenkins job.
> Job's definition:
> [https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_LoadTests_GBK_Python.groovy]
> Test pipeline:
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/group_by_key_test.py]
> It is probable that those cases are too extreme. The first two cases involve
> grouping 20M unique keys, which is a stressful operation. A solution might be
> to overhaul the cases so that they would be less complex.
> Both the current production Dataflow runner and the new Dataflow Runner V2
> were tested.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)