summaryzb commented on PR #2405:
URL: https://github.com/apache/uniffle/pull/2405#issuecomment-2739436830
> Are you talking about changes to Spark? My initial idea was also to see if
I could add a new rule. Maybe for map side, I could add new rules. But for
reduce, adding a new SortExec is determined by determining whether distribution
and partitioning match, which is not easy to do by adding a new Rule.
Both for v1 and v2 datasource api of spark
```
plan match {
case PhysicalOperation(_, _, _: DataSourceV2ScanRelation) =>
new DataSourceV2Strategy(sparkSession).apply(plan).headOption match {
case Some(head) => tryOptimize(head) :: Nil
case _ => Nil
}
case PhysicalOperation(_, _, LogicalRelation(_: HadoopFsRelation, _,
_, _)) =>
FileSourceStrategy(plan).headOption match {
case Some(head) => tryOptimize(head) :: Nil
case _ => Nil
}
case _ => Nil
```
for v2 datasource, tryOptimize follow below 4 scenario to extract the scan
1. [[ProjectExec]] -> [[FilterExec]] -> [[BatchScanExec]]
2. [[ProjectExec]] -> [[BatchScanExec]]
3. [[FilterExec]] -> [[BatchScanExec]]
4. [[BatchScanExec]]
override `protected def partitions` of FileScan to implement partitioning
logic
for v1 datasource, tryOptimize follow below 4 scenario to extract the scan
1. [[ProjectExec]] -> [[FilterExec]] -> [[FileSourceScanExec]]
2. [[ProjectExec]] -> [[FileSourceScanExec]]
3. [[FilterExec]] -> [[FileSourceScanExec]]
4. [[FileSourceScanExec]]
replace `private def createReadRDD` for inputRdd of FileSourceScanExec to
implement partitioning logic
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]