viirya commented on code in PR #100:
URL:
https://github.com/apache/arrow-datafusion-comet/pull/100#discussion_r1506313439
##########
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala:
##########
@@ -451,6 +468,23 @@ class CometSparkSessionExtensions
}
}
}
+
+ // CometExec already wraps a `ColumnarToRowExec` for row-based operators.
Therefore,
+ // `ColumnarToRowExec` is redundant and can be eliminated.
+ //
+ // It was added during ApplyColumnarRulesAndInsertTransitions'
insertTransitions phase when Spark
+ // requests row-based output such as `collect` call. It's correct to add a
redundant
+ // `ColumnarToRowExec` for `CometExec`. However, for certain operators such
as
+ // `CometCollectLimitExec` which overrides `executeCollect`, the redundant
`ColumnarToRowExec`
+ // makes the override ineffective. The purpose of this rule is to eliminate
the redundant
+ // `ColumnarToRowExec` for such operators.
+ case class EliminateRedundantColumnarToRow(session: SparkSession) extends
Rule[SparkPlan] {
+ override def apply(plan: SparkPlan): SparkPlan = {
+ plan.transform { case ColumnarToRowExec(child: CometCollectLimitExec) =>
+ child
Review Comment:
> I'd like to point out that `ColumnarToRowExec + CometCollectLimitExec`
will always be the end of the query as `CollectLimitExec` is the end of query.
Yes, this is usually the case. But I remember that at some special cases,
users can produce a query tree that some others operators on top of
`CollectLimitExec`. I think this is why `CollectLimitExec` still implements
`doExecute` not just `executeCollect`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]