Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

DB Tsai Sat, 23 Oct 2021 23:29:14 -0700

+1 on this SPIP.

This is a more generalized version of bucketed tables and bucketed
joins which can eliminate very expensive data shuffles when joins, and
many users in the Apache Spark community have wanted this feature for
a long time!


Thank you, Ryan and Chao, for working on this, and I look forward to
it as a new feature in Spark 3.3

DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <sunc...@apache.org> wrote:
>
> Hi,
>
> Ryan and I drafted a design doc to support a new type of join: storage 
> partitioned join which covers bucket join support for DataSourceV2 but is 
> more general. The goal is to let Spark leverage distribution properties 
> reported by data sources and eliminate shuffle whenever possible.
>
> Design doc: 
> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>  (includes a POC link at the end)
>
> We'd like to start a discussion on the doc and any feedback is welcome!
>
> Thanks,
> Chao

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Reply via email to