Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Ryan Blue Tue, 26 Oct 2021 09:39:03 -0700

Instead of commenting on the doc, could we keep discussion here on the dev
list please? That way more people can follow it and there is more room for
discussion. Comment threads have a very small area and easily become hard
to follow.


Ryan

On Tue, Oct 26, 2021 at 9:32 AM John Zhuge <jzh...@apache.org> wrote:

> +1  Nicely done!
>
> On Tue, Oct 26, 2021 at 8:08 AM Chao Sun <sunc...@apache.org> wrote:
>
>> Oops, sorry. I just fixed the permission setting.
>>
>> Thanks everyone for the positive support!
>>
>> On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>>
>>> +1 to this SPIP and nice writeup of the design doc!
>>>
>>> Can we open comment permission in the doc so that we can discuss details
>>> there?
>>>
>>> On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon <gurwls...@gmail.com>
>>> wrote:
>>>
>>>> Seems making sense to me.
>>>>
>>>> Would be great to have some feedback from people such as @Wenchen Fan
>>>> <wenc...@databricks.com> @Cheng Su <chen...@fb.com> @angers zhu
>>>> <angers....@gmail.com>.
>>>>
>>>>
>>>> On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun <dongjoon.h...@gmail.com>
>>>> wrote:
>>>>
>>>>> +1 for this SPIP.
>>>>>
>>>>> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao <huaxin.ga...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> +1. Thanks for lifting the current restrictions on bucket join and
>>>>>> making this more generalized.
>>>>>>
>>>>>> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue <b...@apache.org> wrote:
>>>>>>
>>>>>>> +1 from me as well. Thanks Chao for doing so much to get it to this
>>>>>>> point!
>>>>>>>
>>>>>>> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai <dbt...@dbtsai.com> wrote:
>>>>>>>
>>>>>>>> +1 on this SPIP.
>>>>>>>>
>>>>>>>> This is a more generalized version of bucketed tables and bucketed
>>>>>>>> joins which can eliminate very expensive data shuffles when joins,
>>>>>>>> and
>>>>>>>> many users in the Apache Spark community have wanted this feature
>>>>>>>> for
>>>>>>>> a long time!
>>>>>>>>
>>>>>>>> Thank you, Ryan and Chao, for working on this, and I look forward to
>>>>>>>> it as a new feature in Spark 3.3
>>>>>>>>
>>>>>>>> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>>>>>>>>
>>>>>>>> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <sunc...@apache.org>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > Hi,
>>>>>>>> >
>>>>>>>> > Ryan and I drafted a design doc to support a new type of join:
>>>>>>>> storage partitioned join which covers bucket join support for 
>>>>>>>> DataSourceV2
>>>>>>>> but is more general. The goal is to let Spark leverage distribution
>>>>>>>> properties reported by data sources and eliminate shuffle whenever 
>>>>>>>> possible.
>>>>>>>> >
>>>>>>>> > Design doc:
>>>>>>>> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
>>>>>>>> (includes a POC link at the end)
>>>>>>>> >
>>>>>>>> > We'd like to start a discussion on the doc and any feedback is
>>>>>>>> welcome!
>>>>>>>> >
>>>>>>>> > Thanks,
>>>>>>>> > Chao
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>>
>>>>>>
>
> --
> John Zhuge
>


-- 
Ryan Blue

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Reply via email to