Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

huaxin gao Sun, 24 Oct 2021 09:59:38 -0700

+1. Thanks for lifting the current restrictions on bucket join and making
this more generalized.


On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue <b...@apache.org> wrote:

> +1 from me as well. Thanks Chao for doing so much to get it to this point!
>
> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai <dbt...@dbtsai.com> wrote:
>
>> +1 on this SPIP.
>>
>> This is a more generalized version of bucketed tables and bucketed
>> joins which can eliminate very expensive data shuffles when joins, and
>> many users in the Apache Spark community have wanted this feature for
>> a long time!
>>
>> Thank you, Ryan and Chao, for working on this, and I look forward to
>> it as a new feature in Spark 3.3
>>
>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>
>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <sunc...@apache.org> wrote:
>> >
>> > Hi,
>> >
>> > Ryan and I drafted a design doc to support a new type of join: storage
>> partitioned join which covers bucket join support for DataSourceV2 but is
>> more general. The goal is to let Spark leverage distribution properties
>> reported by data sources and eliminate shuffle whenever possible.
>> >
>> > Design doc:
>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>> (includes a POC link at the end)
>> >
>> > We'd like to start a discussion on the doc and any feedback is welcome!
>> >
>> > Thanks,
>> > Chao
>>
>
>
> --
> Ryan Blue
>

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Reply via email to