Hi, Ryan and I drafted a design doc to support a new type of join: storage partitioned join which covers bucket join support for DataSourceV2 but is more general. The goal is to let Spark leverage distribution properties reported by data sources and eliminate shuffle whenever possible.
Design doc: https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE (includes a POC link at the end) We'd like to start a discussion on the doc and any feedback is welcome! Thanks, Chao