Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Dongjoon Hyun Tue, 26 Oct 2021 01:25:16 -0700

+1 for this SPIP.

On Sun, Oct 24, 2021 at 9:59 AM huaxin gao <[email protected]> wrote:


> +1. Thanks for lifting the current restrictions on bucket join and making
> this more generalized.
>
> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue <[email protected]> wrote:
>
>> +1 from me as well. Thanks Chao for doing so much to get it to this point!
>>
>> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai <[email protected]> wrote:
>>
>>> +1 on this SPIP.
>>>
>>> This is a more generalized version of bucketed tables and bucketed
>>> joins which can eliminate very expensive data shuffles when joins, and
>>> many users in the Apache Spark community have wanted this feature for
>>> a long time!
>>>
>>> Thank you, Ryan and Chao, for working on this, and I look forward to
>>> it as a new feature in Spark 3.3
>>>
>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>
>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <[email protected]> wrote:
>>> >
>>> > Hi,
>>> >
>>> > Ryan and I drafted a design doc to support a new type of join: storage
>>> partitioned join which covers bucket join support for DataSourceV2 but is
>>> more general. The goal is to let Spark leverage distribution properties
>>> reported by data sources and eliminate shuffle whenever possible.
>>> >
>>> > Design doc:
>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>>> (includes a POC link at the end)
>>> >
>>> > We'd like to start a discussion on the doc and any feedback is welcome!
>>> >
>>> > Thanks,
>>> > Chao
>>>
>>
>>
>> --
>> Ryan Blue
>>
>

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Reply via email to