Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Ryan Blue Sun, 24 Oct 2021 09:33:54 -0700

+1 from me as well. Thanks Chao for doing so much to get it to this point!

On Sat, Oct 23, 2021 at 11:29 PM DB Tsai <dbt...@dbtsai.com> wrote:


> +1 on this SPIP.
>
> This is a more generalized version of bucketed tables and bucketed
> joins which can eliminate very expensive data shuffles when joins, and
> many users in the Apache Spark community have wanted this feature for
> a long time!
>
> Thank you, Ryan and Chao, for working on this, and I look forward to
> it as a new feature in Spark 3.3
>
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>
> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <sunc...@apache.org> wrote:
> >
> > Hi,
> >
> > Ryan and I drafted a design doc to support a new type of join: storage
> partitioned join which covers bucket join support for DataSourceV2 but is
> more general. The goal is to let Spark leverage distribution properties
> reported by data sources and eliminate shuffle whenever possible.
> >
> > Design doc:
> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
> (includes a POC link at the end)
> >
> > We'd like to start a discussion on the doc and any feedback is welcome!
> >
> > Thanks,
> > Chao
>


-- 
Ryan Blue

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Reply via email to