[ 
https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mori[A]rty updated SPARK-33531:
-------------------------------
    Description: 
Currently, when invoking CollectLimitExec#executeToIterator, a single-partition 
ShuffledRowRDD containg all parent partitions is created. Spark will compute 
all these partitions to get the result.
But in most cases, computing the first few partitions is enought to get the 
result, which takes much less time.

When running a SparkThriftServer and spark.sql.thriftServer.incrementalCollect 
is enabled, too many shuffle tasks will lead to a significant performance issue 
for SQLs terminated with LIMIT.

A possible improvement may be as follows:
 # Create a ShuffledRowRDD containing the first parent partition.
 # Collect rows of this ShuffledRowRDD to driver
 # If collected rows is less than limit number, then create the next 
ShuffledRowRDD with serveral following parent partitions. The number of parent 
partitions is calculated the same way as SparkPlan#executeTake.
 # Repeat 2~3 until total collected rows reaches limit number or all parent 
partitions have been computed.

  was:
Using a new method SparkPlan#executeTakeToIterator to implement 
CollectLimitExec#executeToIterator to avoid shuffle caused by invoking parent 
method SparkPlan#executeToIterator.

When running a SparkThriftServer and spark.sql.thriftServer.incrementalCollect 
is enabled, extra shuffle will lead to a significant performance issue for SQLs 
terminated with LIMIT.


> [SQL] Reduce shuffle task number when calling 
> CollectLimitExec#executeToIterator
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-33531
>                 URL: https://issues.apache.org/jira/browse/SPARK-33531
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.1
>            Reporter: Mori[A]rty
>            Priority: Major
>
> Currently, when invoking CollectLimitExec#executeToIterator, a 
> single-partition ShuffledRowRDD containg all parent partitions is created. 
> Spark will compute all these partitions to get the result.
> But in most cases, computing the first few partitions is enought to get the 
> result, which takes much less time.
> When running a SparkThriftServer and 
> spark.sql.thriftServer.incrementalCollect is enabled, too many shuffle tasks 
> will lead to a significant performance issue for SQLs terminated with LIMIT.
> A possible improvement may be as follows:
>  # Create a ShuffledRowRDD containing the first parent partition.
>  # Collect rows of this ShuffledRowRDD to driver
>  # If collected rows is less than limit number, then create the next 
> ShuffledRowRDD with serveral following parent partitions. The number of 
> parent partitions is calculated the same way as SparkPlan#executeTake.
>  # Repeat 2~3 until total collected rows reaches limit number or all parent 
> partitions have been computed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to