Re: [PR] [#1750] feat(remote merge): Support Spark. [uniffle]

via GitHub Thu, 20 Mar 2025 00:23:08 -0700


summaryzb commented on PR #2405:
URL: https://github.com/apache/uniffle/pull/2405#issuecomment-2739436830


   > Are you talking about changes to Spark? My initial idea was also to see if 
I could add a new rule. Maybe for map side, I could add new rules. But for 
reduce, adding a new SortExec is determined by determining whether distribution 
and partitioning match, which is not easy to do by adding a new Rule. 
   
   Both for v1 and v2 datasource api of spark
   ```
       plan match {
         case PhysicalOperation(_, _, _: DataSourceV2ScanRelation) =>
           new DataSourceV2Strategy(sparkSession).apply(plan).headOption match {
             case Some(head) => tryOptimize(head) :: Nil
             case _ => Nil
           }
         case PhysicalOperation(_, _, LogicalRelation(_: HadoopFsRelation, _, 
_, _)) =>
           FileSourceStrategy(plan).headOption match {
             case Some(head) => tryOptimize(head) :: Nil
             case _ => Nil
           }
         case _ => Nil
   ```
   
   
   for v2 datasource, tryOptimize follow below 4 scenario to extract the scan
   1. [[ProjectExec]] -> [[FilterExec]] -> [[BatchScanExec]]
   2. [[ProjectExec]] -> [[BatchScanExec]]
   3. [[FilterExec]] -> [[BatchScanExec]]
   4. [[BatchScanExec]]
   override `protected def partitions` of FileScan to implement partitioning 
logic
   
   
   for v1 datasource, tryOptimize follow below 4 scenario to extract the scan
   1. [[ProjectExec]] -> [[FilterExec]] -> [[FileSourceScanExec]]
   2. [[ProjectExec]] -> [[FileSourceScanExec]]
   3. [[FilterExec]] -> [[FileSourceScanExec]]
   4. [[FileSourceScanExec]]
   replace `private def createReadRDD` for inputRdd of FileSourceScanExec to 
implement partitioning logic
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [#1750] feat(remote merge): Support Spark. [uniffle]

Reply via email to