subject:"\"\\\[GitHub\\\] \\\[spark\\\] zhengruifeng commented on pull request #31468\\\: \\\[SPARK\\\-34353\\\]\\\[SQL\\\] CollectLimitExec avoid shuffle if input rdd has single partition\""

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

2021-02-09 Thread GitBox



zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775778416


   ```
   scala> spark.sql("CREATE TABLE t (key bigint, value string) USING parquet")
   res0: org.apache.spark.sql.DataFrame = []
   
   scala> spark.sql("SELECT COUNT(*) FROM t").head
   res1: org.apache.spark.sql.Row = [0]
   
   scala> val t = spark.table("t")
   t: org.apache.spark.sql.DataFrame = [key: bigint, value: string]
   
   scala> t.count
   res2: Long = 0
   
   scala> t.rdd.getNumPartitions
   res3: Int = 0
   
   scala> 
   
   scala> spark.sql(s"CACHE TABLE v1 AS SELECT * FROM t LIMIT 10")
   java.util.NoSuchElementException: next on empty iterator
 at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
 at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
 at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
 at scala.collection.IterableLike.head(IterableLike.scala:109)
 at scala.collection.IterableLike.head$(IterableLike.scala:108)
 at 
scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
 at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
 at 
scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
 at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
 at org.apache.spark.sql.Dataset.$anonfun$count$1(Dataset.scala:3024)
 at 
org.apache.spark.sql.Dataset.$anonfun$count$1$adapted(Dataset.scala:3023)
 at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
 at org.apache.spark.sql.Dataset.count(Dataset.scala:3023)
 at 
org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run(CacheTableExec.scala:65)
 at 
org.apache.spark.sql.execution.datasources.v2.BaseCacheTableExec.run$(CacheTableExec.scala:41)
 at 
org.apache.spark.sql.execution.datasources.v2.CacheTableAsSelectExec.run(CacheTableExec.scala:88)
 at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:42)
 at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:42)
 at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:48)
 at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228)
 at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3705)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3703)
 at org.apache.spark.sql.Dataset.(Dataset.scala:228)
 at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
 at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:615)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:772)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:610)
 ... 47 elided
   
   scala> spark.sql("SELECT COUNT(*) FROM v1").head
   java.util.NoSuchElementException: next on empty iterator
 at scala.collection.Iterator$$anon$2.next(Iterator.scala:41)
 at scala.collection.Iterator$$anon$2.next(Iterator.scala:39)
 at scala.collection.IndexedSeqLike$Elements.next(IndexedSeqLike.scala:63)
 at scala.collection.IterableLike.head(IterableLike.scala:109)
 at scala.collection.IterableLike.head$(IterableLike.scala:108)
 at 
scala.collection.mutable.ArrayOps$ofRef.scala$collection$IndexedSeqOptimized$$super$head(ArrayOps.scala:198)
 at scala.collection.IndexedSeqOptimized.head(IndexedSeqOptimized.scala:129)
 at 
scala.collection.IndexedSeqOptimized.head$(IndexedSeqOptimized.scala:129)
 at scala.collection.mutable.ArrayOps$ofRef.head(ArrayOps.scala:198)
 at org.apache.spark.sql.Dataset.head(Dataset.scala:2747)

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

2021-02-08 Thread GitBox



zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-775732036


   retest this please



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

2021-02-05 Thread GitBox



zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773112265







This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

2021-02-04 Thread GitBox



zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773112511


   related to https://github.com/apache/spark/pull/31409



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

2021-02-04 Thread GitBox



zhengruifeng commented on pull request #31468:
URL: https://github.com/apache/spark/pull/31468#issuecomment-773112265


   > We may have more operators that adding shuffle in the doExecute method 
instead of the planner
   
   @cloud-fan shuffle only is directly added in the `doExecute` method of 
`TakeOrderedAndProjectExec`, `CollectLimitExec` and `ShuffleExchangeExec`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

[GitHub] [spark] zhengruifeng commented on pull request #31468: [SPARK-34353][SQL] CollectLimitExec avoid shuffle if input rdd has single partition

5 matches

Site Navigation

Mail list logo

Footer information