Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

L . C . Hsieh Wed, 27 Oct 2021 09:42:47 -0700

+1 for the SPIP. This is a great improvement and optimization!

On 2021/10/26 19:01:03, Erik Krogen <xkro...@apache.org> wrote: 
> It's great to see this SPIP going live. Once this is complete, it will
> really help Spark to play nicely with a broader data ecosystem (Hive,
> Iceberg, Trino, etc.), and it's great to see that besides just bringing the
> existing bucketed-join support to V2, we are also making the types of
> partitioning that can be accommodated more broad and leaving open pathways
> for future optimizations like partially clustered distributions.
> 
> Big thanks to Ryan and Chao!
> 
> On Tue, Oct 26, 2021 at 10:35 AM Cheng Su <chen...@fb.com.invalid> wrote:
> 
> > +1 for this. This is exciting movement to efficiently read bucketed table
> > from other systems (Hive, Trino & Presto)!
> >
> >
> >
> > Still looking at the details but having some early questions:
> >
> >
> >
> >    1. Is migrating Hive table read path to data source v2, being a
> >    prerequisite of this SPIP?
> >
> >
> >
> > Hive table read path is currently a mix of data source v1 (for Parquet &
> > ORC file format only), and legacy Hive code path (HiveTableScanExec). In
> > the SPIP, I am seeing we only make change for data source v2, so wondering
> > how this would work with existing Hive table read path. In addition, just
> > FYI, supporting writing Hive bucketed table is merged in master recently (
> > SPARK-19256 <https://issues.apache.org/jira/browse/SPARK-19256> has
> > details).
> >
> >
> >
> >    1. Would aggregate work automatically after the SPIP?
> >
> >
> >
> > Another major benefit for having bucketed table, is to avoid shuffle
> > before aggregate. Just want to bring to our attention that it would be
> > great to consider aggregate as well when doing this proposal.
> >
> >
> >
> >    1. Any major use cases in mind except Hive bucketed table?
> >
> >
> >
> > Just curious if there’s any other use cases we are targeting as part of
> > SPIP.
> >
> >
> >
> > Thanks,
> >
> > Cheng Su
> >
> >
> >
> >
> >
> >
> >
> > *From: *Ryan Blue <b...@apache.org>
> > *Date: *Tuesday, October 26, 2021 at 9:39 AM
> > *To: *John Zhuge <jzh...@apache.org>
> > *Cc: *Chao Sun <sunc...@apache.org>, Wenchen Fan <cloud0...@gmail.com>,
> > Cheng Su <chen...@fb.com>, DB Tsai <dbt...@dbtsai.com>, Dongjoon Hyun <
> > dongjoon.h...@gmail.com>, Hyukjin Kwon <gurwls...@gmail.com>, Wenchen Fan
> > <wenc...@databricks.com>, angers zhu <angers....@gmail.com>, dev <
> > dev@spark.apache.org>, huaxin gao <huaxin.ga...@gmail.com>
> > *Subject: *Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2
> >
> > Instead of commenting on the doc, could we keep discussion here on the dev
> > list please? That way more people can follow it and there is more room for
> > discussion. Comment threads have a very small area and easily become hard
> > to follow.
> >
> >
> >
> > Ryan
> >
> >
> >
> > On Tue, Oct 26, 2021 at 9:32 AM John Zhuge <jzh...@apache.org> wrote:
> >
> > +1  Nicely done!
> >
> >
> >
> > On Tue, Oct 26, 2021 at 8:08 AM Chao Sun <sunc...@apache.org> wrote:
> >
> > Oops, sorry. I just fixed the permission setting.
> >
> >
> >
> > Thanks everyone for the positive support!
> >
> >
> >
> > On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan <cloud0...@gmail.com> wrote:
> >
> > +1 to this SPIP and nice writeup of the design doc!
> >
> >
> >
> > Can we open comment permission in the doc so that we can discuss details
> > there?
> >
> >
> >
> > On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon <gurwls...@gmail.com> wrote:
> >
> > Seems making sense to me.
> >
> > Would be great to have some feedback from people such as @Wenchen Fan
> > <wenc...@databricks.com> @Cheng Su <chen...@fb.com> @angers zhu
> > <angers....@gmail.com>.
> >
> >
> >
> >
> >
> > On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun <dongjoon.h...@gmail.com>
> > wrote:
> >
> > +1 for this SPIP.
> >
> >
> >
> > On Sun, Oct 24, 2021 at 9:59 AM huaxin gao <huaxin.ga...@gmail.com> wrote:
> >
> > +1. Thanks for lifting the current restrictions on bucket join and making
> > this more generalized.
> >
> >
> >
> > On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue <b...@apache.org> wrote:
> >
> > +1 from me as well. Thanks Chao for doing so much to get it to this point!
> >
> >
> >
> > On Sat, Oct 23, 2021 at 11:29 PM DB Tsai <dbt...@dbtsai.com> wrote:
> >
> > +1 on this SPIP.
> >
> > This is a more generalized version of bucketed tables and bucketed
> > joins which can eliminate very expensive data shuffles when joins, and
> > many users in the Apache Spark community have wanted this feature for
> > a long time!
> >
> > Thank you, Ryan and Chao, for working on this, and I look forward to
> > it as a new feature in Spark 3.3
> >
> > DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> >
> > On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <sunc...@apache.org> wrote:
> > >
> > > Hi,
> > >
> > > Ryan and I drafted a design doc to support a new type of join: storage
> > partitioned join which covers bucket join support for DataSourceV2 but is
> > more general. The goal is to let Spark leverage distribution properties
> > reported by data sources and eliminate shuffle whenever possible.
> > >
> > > Design doc:
> > https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE
> > (includes a POC link at the end)
> > >
> > > We'd like to start a discussion on the doc and any feedback is welcome!
> > >
> > > Thanks,
> > > Chao
> >
> >
> >
> >
> > --
> >
> > Ryan Blue
> >
> >
> >
> >
> > --
> >
> > John Zhuge
> >
> >
> >
> >
> > --
> >
> > Ryan Blue
> >
>


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2

Reply via email to