[ https://issues.apache.org/jira/browse/SPARK-40211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584995#comment-17584995 ]
Apache Spark commented on SPARK-40211: -------------------------------------- User 'liuzqt' has created a pull request for this issue: https://github.com/apache/spark/pull/37661 > Allow executeTake() / collectLimit's number of starting partitions to be > customized > ----------------------------------------------------------------------------------- > > Key: SPARK-40211 > URL: https://issues.apache.org/jira/browse/SPARK-40211 > Project: Spark > Issue Type: Story > Components: Spark Core, SQL > Affects Versions: 3.4.0 > Reporter: Ziqi Liu > Priority: Major > > Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be > customized but does not allow for the initial number of partitions to be > customized: it’s currently hardcoded to {{{}1{}}}. > We should add a configuration so that the initial partition count can be > customized. By setting this new configuration to a high value we could > effectively mitigate the “run multiple jobs” overhead in {{take}} behavior. > We could also set it to higher-than-1-but-still-small values (like, say, > {{{}10{}}}) to achieve a middle-ground trade-off. > > Essentially, we need to make {{numPartsToTry = 1L}} > ([code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L481]) > customizable. We should do this via a new SQL conf, similar to the > {{limitScaleUpFactor}} conf. > > Spark has several near-duplicate versions of this code ([see code > search|https://github.com/apache/spark/search?q=numPartsToTry+%3D+1]) in: > * SparkPlan > * RDD > * pyspark rdd > Also, in pyspark {{limitScaleUpFactor}} is not supported either. So for > now, I will focus on scala side first, leaving python side untouched and > meanwhile sync with pyspark members. Depending on the progress we can do them > all in one PR or make scala side change first and leave pyspark change as a > follow-up. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org