+1 for the SPIP. This is a great improvement and optimization! On 2021/10/26 19:01:03, Erik Krogen <xkro...@apache.org> wrote: > It's great to see this SPIP going live. Once this is complete, it will > really help Spark to play nicely with a broader data ecosystem (Hive, > Iceberg, Trino, etc.), and it's great to see that besides just bringing the > existing bucketed-join support to V2, we are also making the types of > partitioning that can be accommodated more broad and leaving open pathways > for future optimizations like partially clustered distributions. > > Big thanks to Ryan and Chao! > > On Tue, Oct 26, 2021 at 10:35 AM Cheng Su <chen...@fb.com.invalid> wrote: > > > +1 for this. This is exciting movement to efficiently read bucketed table > > from other systems (Hive, Trino & Presto)! > > > > > > > > Still looking at the details but having some early questions: > > > > > > > > 1. Is migrating Hive table read path to data source v2, being a > > prerequisite of this SPIP? > > > > > > > > Hive table read path is currently a mix of data source v1 (for Parquet & > > ORC file format only), and legacy Hive code path (HiveTableScanExec). In > > the SPIP, I am seeing we only make change for data source v2, so wondering > > how this would work with existing Hive table read path. In addition, just > > FYI, supporting writing Hive bucketed table is merged in master recently ( > > SPARK-19256 <https://issues.apache.org/jira/browse/SPARK-19256> has > > details). > > > > > > > > 1. Would aggregate work automatically after the SPIP? > > > > > > > > Another major benefit for having bucketed table, is to avoid shuffle > > before aggregate. Just want to bring to our attention that it would be > > great to consider aggregate as well when doing this proposal. > > > > > > > > 1. Any major use cases in mind except Hive bucketed table? > > > > > > > > Just curious if there’s any other use cases we are targeting as part of > > SPIP. > > > > > > > > Thanks, > > > > Cheng Su > > > > > > > > > > > > > > > > *From: *Ryan Blue <b...@apache.org> > > *Date: *Tuesday, October 26, 2021 at 9:39 AM > > *To: *John Zhuge <jzh...@apache.org> > > *Cc: *Chao Sun <sunc...@apache.org>, Wenchen Fan <cloud0...@gmail.com>, > > Cheng Su <chen...@fb.com>, DB Tsai <dbt...@dbtsai.com>, Dongjoon Hyun < > > dongjoon.h...@gmail.com>, Hyukjin Kwon <gurwls...@gmail.com>, Wenchen Fan > > <wenc...@databricks.com>, angers zhu <angers....@gmail.com>, dev < > > dev@spark.apache.org>, huaxin gao <huaxin.ga...@gmail.com> > > *Subject: *Re: [DISCUSS] SPIP: Storage Partitioned Join for Data Source V2 > > > > Instead of commenting on the doc, could we keep discussion here on the dev > > list please? That way more people can follow it and there is more room for > > discussion. Comment threads have a very small area and easily become hard > > to follow. > > > > > > > > Ryan > > > > > > > > On Tue, Oct 26, 2021 at 9:32 AM John Zhuge <jzh...@apache.org> wrote: > > > > +1 Nicely done! > > > > > > > > On Tue, Oct 26, 2021 at 8:08 AM Chao Sun <sunc...@apache.org> wrote: > > > > Oops, sorry. I just fixed the permission setting. > > > > > > > > Thanks everyone for the positive support! > > > > > > > > On Tue, Oct 26, 2021 at 7:30 AM Wenchen Fan <cloud0...@gmail.com> wrote: > > > > +1 to this SPIP and nice writeup of the design doc! > > > > > > > > Can we open comment permission in the doc so that we can discuss details > > there? > > > > > > > > On Tue, Oct 26, 2021 at 8:29 PM Hyukjin Kwon <gurwls...@gmail.com> wrote: > > > > Seems making sense to me. > > > > Would be great to have some feedback from people such as @Wenchen Fan > > <wenc...@databricks.com> @Cheng Su <chen...@fb.com> @angers zhu > > <angers....@gmail.com>. > > > > > > > > > > > > On Tue, 26 Oct 2021 at 17:25, Dongjoon Hyun <dongjoon.h...@gmail.com> > > wrote: > > > > +1 for this SPIP. > > > > > > > > On Sun, Oct 24, 2021 at 9:59 AM huaxin gao <huaxin.ga...@gmail.com> wrote: > > > > +1. Thanks for lifting the current restrictions on bucket join and making > > this more generalized. > > > > > > > > On Sun, Oct 24, 2021 at 9:33 AM Ryan Blue <b...@apache.org> wrote: > > > > +1 from me as well. Thanks Chao for doing so much to get it to this point! > > > > > > > > On Sat, Oct 23, 2021 at 11:29 PM DB Tsai <dbt...@dbtsai.com> wrote: > > > > +1 on this SPIP. > > > > This is a more generalized version of bucketed tables and bucketed > > joins which can eliminate very expensive data shuffles when joins, and > > many users in the Apache Spark community have wanted this feature for > > a long time! > > > > Thank you, Ryan and Chao, for working on this, and I look forward to > > it as a new feature in Spark 3.3 > > > > DB Tsai | https://www.dbtsai.com/ | PGP 42E5B25A8F7A82C1 > > > > On Fri, Oct 22, 2021 at 12:18 PM Chao Sun <sunc...@apache.org> wrote: > > > > > > Hi, > > > > > > Ryan and I drafted a design doc to support a new type of join: storage > > partitioned join which covers bucket join support for DataSourceV2 but is > > more general. The goal is to let Spark leverage distribution properties > > reported by data sources and eliminate shuffle whenever possible. > > > > > > Design doc: > > https://docs.google.com/document/d/1foTkDSM91VxKgkEcBMsuAvEjNybjja-uHk-r3vtXWFE > > (includes a POC link at the end) > > > > > > We'd like to start a discussion on the doc and any feedback is welcome! > > > > > > Thanks, > > > Chao > > > > > > > > > > -- > > > > Ryan Blue > > > > > > > > > > -- > > > > John Zhuge > > > > > > > > > > -- > > > > Ryan Blue > > >
--------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org