szehon-ho opened a new pull request, #42306:
URL: https://github.com/apache/spark/pull/42306

   ### What changes were proposed in this pull request?
   - Add new conf 
spark.sql.sources.v2.bucketing.allowJoinKeysSubsetOfPartitionKeys.enabled
   - Change key compatibility checks in EnsureRequirements.  Remove checks 
where all partition keys must be in join keys to allow isKeyCompatible = true 
in this case.
   - Change BatchScanExec/DataSourceV2Relation to group splits by join keys if 
they differ from partition keys (previously grouped only by partition values)
   - Implement partiallyClustered skew-handling.
     - Group only the replicate side (now by join key as well)
     - add an additional sort in the end of partitions based on join key, as 
when we group the non-replicate side, partition ordering becomes out of order.
   
   ### Why are the changes needed?
   - Support Storage Partition Join in cases where the join condition does not 
contain all the partition keys, but just some of them
   
   ### Does this PR introduce _any_ user-facing change? 
   No
   
   ### How was this patch tested? -Added tests in KeyGroupedPartitioningSuite
   -Found two problems, will address in separate PR:
   - https://github.com/apache/spark/pull/37886  made another change so that we 
have to select all join keys, otherwise DSV2 scan does not report 
KeyGroupedPartitioning and SPJ does not get triggered.  Need to see how to 
relax this.
   - https://issues.apache.org/jira/browse/SPARK-44641 was found when testing 
this change.  This pr refactors some of those code to add group-by-join-key, 
but doesnt change the underlying logic, so issue continues to exist.  Hopefully 
this will also get fixed in another way.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to