Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Kent Yao Tue, 30 Apr 2024 02:51:22 -0700

+1

Kent Yao


On 2024/04/30 09:07:21 Yuming Wang wrote:
> +1
> 
> On Tue, Apr 30, 2024 at 3:31 PM Ye Xianjin <advance...@gmail.com> wrote:
> 
> > +1
> > Sent from my iPhone
> >
> > On Apr 30, 2024, at 3:23 PM, DB Tsai <dbt...@dbtsai.com> wrote:
> >
> > 
> > +1
> >
> > On Apr 29, 2024, at 8:01 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
> >
> > 
> > To add more color:
> >
> > Spark data source table and Hive Serde table are both stored in the Hive
> > metastore and keep the data files in the table directory. The only
> > difference is they have different "table provider", which means Spark will
> > use different reader/writer. Ideally the Spark native data source
> > reader/writer is faster than the Hive Serde ones.
> >
> > What's more, the default format of Hive Serde is text. I don't think
> > people want to use text format tables in production. Most people will add
> > `STORED AS parquet` or `USING parquet` explicitly. By setting this config
> > to false, we have a more reasonable default behavior: creating Parquet
> > tables (or whatever is specified by `spark.sql.sources.default`).
> >
> > On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan <cloud0...@gmail.com> wrote:
> >
> >> @Mich Talebzadeh <mich.talebza...@gmail.com> there seems to be a
> >> misunderstanding here. The Spark native data source table is still stored
> >> in the Hive metastore, it's just that Spark will use a different (and
> >> faster) reader/writer for it. `hive-site.xml` should work as it is today.
> >>
> >> On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon <gurwls...@apache.org>
> >> wrote:
> >>
> >>> +1
> >>>
> >>> It's a legacy conf that we should eventually remove it away. Spark
> >>> should create Spark table by default, not Hive table.
> >>>
> >>> Mich, for your workload, you can simply switch that conf off if it
> >>> concerns you. We also enabled ANSI as well (that you agreed on). It's a 
> >>> bit
> >>> akwakrd to stop in the middle for this compatibility reason during making
> >>> Spark sound. The compatibility has been tested in production for a long
> >>> time so I don't see any particular issue about the compatibility case you
> >>> mentioned.
> >>>
> >>> On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh <
> >>> mich.talebza...@gmail.com> wrote:
> >>>
> >>>>
> >>>> Hi @Wenchen Fan <cloud0...@gmail.com>
> >>>>
> >>>> Thanks for your response. I believe we have not had enough time to
> >>>> "DISCUSS" this matter.
> >>>>
> >>>> Currently in order to make Spark take advantage of Hive, I create a
> >>>> soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is
> >>>> 3.1.1
> >>>>
> >>>>  /opt/spark/conf/hive-site.xml ->
> >>>> /data6/hduser/hive-3.1.1/conf/hive-site.xml
> >>>>
> >>>> This works fine for me in my lab. So in the future if we opt to use the
> >>>> setting "spark.sql.legacy.createHiveTableByDefault" to False, there will
> >>>> not be a need for this logical link.?
> >>>> On the face of it, this looks fine but in real life it may require a
> >>>> number of changes to the old scripts. Hence my concern.
> >>>> As a matter of interest has anyone liaised with the Hive team to ensure
> >>>> they have introduced the additional changes you outlined?
> >>>>
> >>>> HTH
> >>>>
> >>>> Mich Talebzadeh,
> >>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> >>>> London
> >>>> United Kingdom
> >>>>
> >>>>
> >>>>    view my Linkedin profile
> >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
> >>>>
> >>>>
> >>>>  https://en.everybodywiki.com/Mich_Talebzadeh
> >>>>
> >>>>
> >>>>
> >>>> *Disclaimer:* The information provided is correct to the best of my
> >>>> knowledge but of course cannot be guaranteed . It is essential to note
> >>>> that, as with any advice, quote "one test result is worth one-thousand
> >>>> expert opinions (Werner
> >>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
> >>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
> >>>>
> >>>>
> >>>> On Sun, 28 Apr 2024 at 09:34, Wenchen Fan <cloud0...@gmail.com> wrote:
> >>>>
> >>>>> @Mich Talebzadeh <mich.talebza...@gmail.com> thanks for sharing your
> >>>>> concern!
> >>>>>
> >>>>> Note: creating Spark native data source tables is usually Hive
> >>>>> compatible as well, unless we use features that Hive does not support
> >>>>> (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to
> >>>>> create Spark native table in this case, instead of creating Hive table 
> >>>>> and
> >>>>> fail.
> >>>>>
> >>>>> On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan <pan3...@gmail.com> wrote:
> >>>>>
> >>>>>> +1 (non-binding)
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Cheng Pan
> >>>>>>
> >>>>>> On Sat, Apr 27, 2024 at 9:29 AM Holden Karau <holden.ka...@gmail.com>
> >>>>>> wrote:
> >>>>>> >
> >>>>>> > +1
> >>>>>> >
> >>>>>> > Twitter: https://twitter.com/holdenkarau
> >>>>>> > Books (Learning Spark, High Performance Spark, etc.):
> >>>>>> https://amzn.to/2MaRAG9
> >>>>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >>>>>> >
> >>>>>> >
> >>>>>> > On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh <vii...@gmail.com>
> >>>>>> wrote:
> >>>>>> >>
> >>>>>> >> +1
> >>>>>> >>
> >>>>>> >> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun <
> >>>>>> dongj...@apache.org> wrote:
> >>>>>> >> >
> >>>>>> >> > I'll start with my +1.
> >>>>>> >> >
> >>>>>> >> > Dongjoon.
> >>>>>> >> >
> >>>>>> >> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
> >>>>>> >> > > Please vote on SPARK-46122 to set
> >>>>>> spark.sql.legacy.createHiveTableByDefault
> >>>>>> >> > > to `false` by default. The technical scope is defined in the
> >>>>>> following PR.
> >>>>>> >> > >
> >>>>>> >> > > - DISCUSSION:
> >>>>>> >> > >
> >>>>>> https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
> >>>>>> >> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
> >>>>>> >> > > - PR: https://github.com/apache/spark/pull/46207
> >>>>>> >> > >
> >>>>>> >> > > The vote is open until April 30th 1AM (PST) and passes
> >>>>>> >> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1
> >>>>>> votes.
> >>>>>> >> > >
> >>>>>> >> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false
> >>>>>> by default
> >>>>>> >> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault
> >>>>>> because ...
> >>>>>> >> > >
> >>>>>> >> > > Thank you in advance.
> >>>>>> >> > >
> >>>>>> >> > > Dongjoon
> >>>>>> >> > >
> >>>>>> >> >
> >>>>>> >> >
> >>>>>> ---------------------------------------------------------------------
> >>>>>> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>>>> >> >
> >>>>>> >>
> >>>>>> >>
> >>>>>> ---------------------------------------------------------------------
> >>>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>>>> >>
> >>>>>>
> >>>>>> ---------------------------------------------------------------------
> >>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >>>>>>
> >>>>>>
> 

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Reply via email to