Re: [PR] feat: Support CollectLimit operator [arrow-datafusion-comet]

via GitHub Tue, 27 Feb 2024 17:28:15 -0800


advancedxy commented on code in PR #100:
URL: 
https://github.com/apache/arrow-datafusion-comet/pull/100#discussion_r1505218764



##########
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala:
##########
@@ -451,6 +468,23 @@ class CometSparkSessionExtensions
       }
     }
   }
+
+  // CometExec already wraps a `ColumnarToRowExec` for row-based operators. 
Therefore,
+  // `ColumnarToRowExec` is redundant and can be eliminated.
+  //
+  // It was added during ApplyColumnarRulesAndInsertTransitions' 
insertTransitions phase when Spark
+  // requests row-based output such as `collect` call. It's correct to add a 
redundant
+  // `ColumnarToRowExec` for `CometExec`. However, for certain operators such 
as
+  // `CometCollectLimitExec` which overrides `executeCollect`, the redundant 
`ColumnarToRowExec`
+  // makes the override ineffective. The purpose of this rule is to eliminate 
the redundant
+  // `ColumnarToRowExec` for such operators.
+  case class EliminateRedundantColumnarToRow(session: SparkSession) extends 
Rule[SparkPlan] {

Review Comment:
   > I see, so if we have the extra ColumnarToRowExec, the code will go through 
its executeCollect instead which will call doExecuteÇolumnar, instead of 
calling the executeCollect in the CollectLimitExec itself.
   
   Yeah, exactly.
   
   > I think we can probably do the same for CometTakeOrderedAndProjectExec too 
- Spark has an executeCollect implementation for this too.
   
   I checked the implementation of `TakeOrderedAndProjectExec.executeCollect` 
when reviewing `CometTakeOrderedAndProjectExec`, it still shuffles all data 
into a single partition which is necessary to satisfy the ordering semantic. 
Hence it's not necessary to do the same for `CometTakeOrderedAndProjectExec`.
   
   > However, I don't know how useful it is since executeCollect is not often 
used
   
   It's used in API/df scenarios, it's quite often for data scientists to 
collect and explore the data via `collect` with limit set. For the pure SQL and 
ETL scenario, I believe it's rarely used. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat: Support CollectLimit operator [arrow-datafusion-comet]

Reply via email to