[jira] [Commented] (SPARK-40211) Allow executeTake() / collectLimit's number of starting partitions to be customized

Apache Spark (Jira) Thu, 25 Aug 2022 12:10:49 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-40211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17584995#comment-17584995
 ]


Apache Spark commented on SPARK-40211:
--------------------------------------

User 'liuzqt' has created a pull request for this issue:
https://github.com/apache/spark/pull/37661

> Allow executeTake() / collectLimit's number of starting partitions to be 
> customized
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-40211
>                 URL: https://issues.apache.org/jira/browse/SPARK-40211
>             Project: Spark
>          Issue Type: Story
>          Components: Spark Core, SQL
>    Affects Versions: 3.4.0
>            Reporter: Ziqi Liu
>            Priority: Major
>
> Today, Spark’s executeTake() code allow for the limitScaleUpFactor to be 
> customized but does not allow for the initial number of partitions to be 
> customized: it’s currently hardcoded to {{{}1{}}}.
> We should add a configuration so that the initial partition count can be 
> customized. By setting this new configuration to a high value we could 
> effectively mitigate the “run multiple jobs” overhead in {{take}} behavior. 
> We could also set it to higher-than-1-but-still-small values (like, say, 
> {{{}10{}}}) to achieve a middle-ground trade-off.
>  
> Essentially, we need to make {{numPartsToTry = 1L}} 
> ([code|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlan.scala#L481])
>  customizable. We should do this via a new SQL conf, similar to the 
> {{limitScaleUpFactor}} conf.
>  
> Spark has several near-duplicate versions of this code ([see code 
> search|https://github.com/apache/spark/search?q=numPartsToTry+%3D+1]) in:
>  * SparkPlan
>  * RDD
>  * pyspark rdd
> Also, in pyspark  {{limitScaleUpFactor}}  is not supported either. So for 
> now, I will focus on scala side first, leaving python side untouched and 
> meanwhile sync with pyspark members. Depending on the progress we can do them 
> all in one PR or make scala side change first and leave pyspark change as a 
> follow-up.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40211) Allow executeTake() / collectLimit's number of starting partitions to be customized

Reply via email to