[ https://issues.apache.org/jira/browse/BEAM-4783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ahmet Altay updated BEAM-4783: ------------------------------ Affects Version/s: 2.8.0 Fix Version/s: 2.9.0 > Add bundleSize parameter to control splitting of Spark sources (useful for > Dynamic Allocation) > ---------------------------------------------------------------------------------------------- > > Key: BEAM-4783 > URL: https://issues.apache.org/jira/browse/BEAM-4783 > Project: Beam > Issue Type: Improvement > Components: runner-spark > Affects Versions: 2.8.0 > Reporter: Kyle Winkelman > Assignee: Kyle Winkelman > Priority: Major > Fix For: 2.8.0, 2.9.0 > > Time Spent: 5h > Remaining Estimate: 0h > > When the spark-runner is used along with the configuration > spark.dynamicAllocation.enabled=true the SourceRDD does not detect this. It > then falls back to the value calculated in this description: > // when running on YARN/SparkDeploy it's the result of max(totalCores, > 2). > // when running on Mesos it's 8. > // when running local it's the total number of cores (local = 1, > local[N] = N, > // local[*] = estimation of the machine's cores). > // ** the configuration "spark.default.parallelism" takes precedence > over all of the above ** > So in most cases this default is quite small. This is an issue when using a > very large input file as it will only get split in half. > I believe that when Dynamic Allocation is enable the SourceRDD should use the > DEFAULT_BUNDLE_SIZE and possibly expose a SparkPipelineOptions that allows > you to change this DEFAULT_BUNDLE_SIZE. -- This message was sent by Atlassian JIRA (v7.6.3#76005)