advancedxy commented on code in PR #100:
URL:
https://github.com/apache/arrow-datafusion-comet/pull/100#discussion_r1505218764
##########
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala:
##########
@@ -451,6 +468,23 @@ class CometSparkSessionExtensions
}
}
}
+
+ // CometExec already wraps a `ColumnarToRowExec` for row-based operators.
Therefore,
+ // `ColumnarToRowExec` is redundant and can be eliminated.
+ //
+ // It was added during ApplyColumnarRulesAndInsertTransitions'
insertTransitions phase when Spark
+ // requests row-based output such as `collect` call. It's correct to add a
redundant
+ // `ColumnarToRowExec` for `CometExec`. However, for certain operators such
as
+ // `CometCollectLimitExec` which overrides `executeCollect`, the redundant
`ColumnarToRowExec`
+ // makes the override ineffective. The purpose of this rule is to eliminate
the redundant
+ // `ColumnarToRowExec` for such operators.
+ case class EliminateRedundantColumnarToRow(session: SparkSession) extends
Rule[SparkPlan] {
Review Comment:
> I see, so if we have the extra ColumnarToRowExec, the code will go through
its executeCollect instead which will call doExecuteÇolumnar, instead of
calling the executeCollect in the CollectLimitExec itself.
Yeah, exactly.
> I think we can probably do the same for CometTakeOrderedAndProjectExec too
- Spark has an executeCollect implementation for this too.
I checked the implementation of `TakeOrderedAndProjectExec.executeCollect`
when reviewing `CometTakeOrderedAndProjectExec`, it still shuffles all data
into a single partition which is necessary to satisfy the ordering semantic.
Hence it's not necessary to do the same for `CometTakeOrderedAndProjectExec`.
> However, I don't know how useful it is since executeCollect is not often
used
It's used in API/df scenarios, it's quite often for data scientists to
collect and explore the data via `collect` with limit set. For the pure SQL and
ETL scenario, I believe it's rarely used.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]