Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Erik Krogen Tue, 26 Oct 2021 12:01:39 -0700

It's great to see this SPIP going live. Once this is complete, it will
really help Spark to play nicely with a broader data ecosystem (Hive,
Iceberg, Trino, etc.), and it's great to see that besides just bringing the
existing bucketed-join support to V2, we are also making the types of
partitioning that can be accommodated more broad and leaving open pathways
for future optimizations like partially clustered distributions.


Big thanks to Ryan and Chao!

On Tue, Oct 26, 2021 at 10:35 AM Cheng Su <[email protected]> wrote:

> +1 for this. This is exciting movement to efficiently read bucketed table
> from other systems (Hive, Trino & Presto)!
>
>
>
> Still looking at the details but having some early questions:
>
>
>
>    1. Is migrating Hive table read path to data source v2, being a
>    prerequisite of this SPIP?
>
>
>
> Hive table read path is currently a mix of data source v1 (for Parquet &
> ORC file format only), and legacy Hive code path (HiveTableScanExec). In
> the SPIP, I am seeing we only make change for data source v2, so wondering
> how this would work with existing Hive table read path. In addition, just
> FYI, supporting writing Hive bucketed table is merged in master recently (
> SPARK-19256 <https://issues.apache.org/jira/browse/SPARK-19256> has
> details).
>
>
>
>    1. Would aggregate work automatically after the SPIP?
>
>
>
> Another major benefit for having bucketed table, is to avoid shuffle
> before aggregate. Just want to bring to our attention that it would be
> great to consider aggregate as well when doing this proposal.
>
>
>
>    1. Any major use cases in mind except Hive bucketed table?
>
>
>
> Just curious if there’s any other use cases we are targeting as part of
> SPIP.
>
>
>
> Thanks,
>
> Cheng Su
>
>
>
>
>
>
>
> *From: *Ryan Blue <[email protected]>
> *Date: *Tuesday, October 26, 2021 at 9:39 AM
> *To: *John Zhuge <[email protected]>
> *Cc: *Chao Sun <[email protected]>, Wenchen Fan <[email protected]>,
> Cheng Su <[email protected]>, DB Tsai <[email protected]>, Dongjoon Hyun <
> [email protected]>, Hyukjin Kwon <[email protected]>, Wenchen Fan
> <[email protected]>, angers zhu <[email protected]>, dev <
> [email protected]>, huaxin gao <[email protected]>
> *Subject: *Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2
>
> Instead of commenting on the doc, could we keep discussion here on the dev
> list please? That way more people can follow it and there is more room for
> discussion. Comment threads have a very small area and easily become hard
> to follow.
>
>
>
> Ryan
>
>
>
> On Tue, Oct 26, 2021 at 9:32 AM John Zhuge <[email protected]> wrote:
>
> +1  Nicely done!
>
>
>
> On Tue, Oct 26, 2021 at 8:08 AM Chao Sun <[email protected]> wrote:
>
> Oops, sorry. I just fixed the permission setting.
>
>
>
> Thanks everyone for the positive support!
>
>
>
> On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan <[email protected]> wrote:
>
> +1 to this SPIP and nice writeup of the design doc!
>
>
>
> Can we open comment permission in the doc so that we can discuss details
> there?
>
>
>
> On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon <[email protected]> wrote:
>
> Seems making sense to me.
>
> Would be great to have some feedback from people such as @Wenchen Fan
> <[email protected]> @Cheng Su <[email protected]> @angers zhu
> <[email protected]>.
>
>
>
>
>
> On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun <[email protected]>
> wrote:
>
> +1 for this SPIP.
>
>
>
> On Sun, Oct 24, 2021 at 9:59 AM huaxin gao <[email protected]> wrote:
>
> +1. Thanks for lifting the current restrictions on bucket join and making
> this more generalized.
>
>
>
> On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue <[email protected]> wrote:
>
> +1 from me as well. Thanks Chao for doing so much to get it to this point!
>
>
>
> On Sat, Oct 23, 2021 at 11:29 PM DB Tsai <[email protected]> wrote:
>
> +1 on this SPIP.
>
> This is a more generalized version of bucketed tables and bucketed
> joins which can eliminate very expensive data shuffles when joins, and
> many users in the Apache Spark community have wanted this feature for
> a long time!
>
> Thank you, Ryan and Chao, for working on this, and I look forward to
> it as a new feature in Spark 3.3
>
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
>
> On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <[email protected]> wrote:
> >
> > Hi,
> >
> > Ryan and I drafted a design doc to support a new type of join: storage
> partitioned join which covers bucket join support for DataSourceV2 but is
> more general. The goal is to let Spark leverage distribution properties
> reported by data sources and eliminate shuffle whenever possible.
> >
> > Design doc:
> https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
> (includes a POC link at the end)
> >
> > We'd like to start a discussion on the doc and any feedback is welcome!
> >
> > Thanks,
> > Chao
>
>
>
>
> --
>
> Ryan Blue
>
>
>
>
> --
>
> John Zhuge
>
>
>
>
> --
>
> Ryan Blue
>

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Reply via email to