[ 
https://issues.apache.org/jira/browse/BEAM-10774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17548788#comment-17548788
 ] 

Danny McCormick commented on BEAM-10774:
----------------------------------------

This issue has been migrated to https://github.com/apache/beam/issues/20403

> GBK Python streaming load tests are too slow
> --------------------------------------------
>
>                 Key: BEAM-10774
>                 URL: https://issues.apache.org/jira/browse/BEAM-10774
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow, sdk-py-core, testing
>            Reporter: Kamil Wasilewski
>            Priority: P3
>
> The following GBK streaming test cases take too long on Dataflow:
>  
> 1) 2GB of 10B records
> 2) 2GB of 100B records
> 4) fanout 4 times with 2GB 10-byte records total
> 5) fanout 8 times with 2GB 10-byte records total
>  
> Each of them needs at least 1 hour to execute, which is way too long for one 
> Jenkins job. 
> Job's definition: 
> [https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_LoadTests_GBK_Python.groovy]
> Test pipeline: 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/load_tests/group_by_key_test.py]
> It is probable that those cases are too extreme. The first two cases involve 
> grouping 20M unique keys, which is a stressful operation. A solution might be 
> to overhaul the cases so that they would be less complex.
> Both the current production Dataflow runner and the new Dataflow Runner V2 
> were tested.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to