Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Yuming Wang Tue, 30 Apr 2024 02:10:29 -0700

+1

On Tue, Apr 30, 2024 at 3:31 PM Ye Xianjin <advance...@gmail.com> wrote:


> +1
> Sent from my iPhone
>
> On Apr 30, 2024, at 3:23 PM, DB Tsai <dbt...@dbtsai.com> wrote:
>
> 
> +1
>
> On Apr 29, 2024, at 8:01 PM, Wenchen Fan <cloud0...@gmail.com> wrote:
>
> 
> To add more color:
>
> Spark data source table and Hive Serde table are both stored in the Hive
> metastore and keep the data files in the table directory. The only
> difference is they have different "table provider", which means Spark will
> use different reader/writer. Ideally the Spark native data source
> reader/writer is faster than the Hive Serde ones.
>
> What's more, the default format of Hive Serde is text. I don't think
> people want to use text format tables in production. Most people will add
> `STORED AS parquet` or `USING parquet` explicitly. By setting this config
> to false, we have a more reasonable default behavior: creating Parquet
> tables (or whatever is specified by `spark.sql.sources.default`).
>
> On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan <cloud0...@gmail.com> wrote:
>
>> @Mich Talebzadeh <mich.talebza...@gmail.com> there seems to be a
>> misunderstanding here. The Spark native data source table is still stored
>> in the Hive metastore, it's just that Spark will use a different (and
>> faster) reader/writer for it. `hive-site.xml` should work as it is today.
>>
>> On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon <gurwls...@apache.org>
>> wrote:
>>
>>> +1
>>>
>>> It's a legacy conf that we should eventually remove it away. Spark
>>> should create Spark table by default, not Hive table.
>>>
>>> Mich, for your workload, you can simply switch that conf off if it
>>> concerns you. We also enabled ANSI as well (that you agreed on). It's a bit
>>> akwakrd to stop in the middle for this compatibility reason during making
>>> Spark sound. The compatibility has been tested in production for a long
>>> time so I don't see any particular issue about the compatibility case you
>>> mentioned.
>>>
>>> On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>>
>>>> Hi @Wenchen Fan <cloud0...@gmail.com>
>>>>
>>>> Thanks for your response. I believe we have not had enough time to
>>>> "DISCUSS" this matter.
>>>>
>>>> Currently in order to make Spark take advantage of Hive, I create a
>>>> soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is
>>>> 3.1.1
>>>>
>>>>  /opt/spark/conf/hive-site.xml ->
>>>> /data6/hduser/hive-3.1.1/conf/hive-site.xml
>>>>
>>>> This works fine for me in my lab. So in the future if we opt to use the
>>>> setting "spark.sql.legacy.createHiveTableByDefault" to False, there will
>>>> not be a need for this logical link.?
>>>> On the face of it, this looks fine but in real life it may require a
>>>> number of changes to the old scripts. Hence my concern.
>>>> As a matter of interest has anyone liaised with the Hive team to ensure
>>>> they have introduced the additional changes you outlined?
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Sun, 28 Apr 2024 at 09:34, Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>
>>>>> @Mich Talebzadeh <mich.talebza...@gmail.com> thanks for sharing your
>>>>> concern!
>>>>>
>>>>> Note: creating Spark native data source tables is usually Hive
>>>>> compatible as well, unless we use features that Hive does not support
>>>>> (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to
>>>>> create Spark native table in this case, instead of creating Hive table and
>>>>> fail.
>>>>>
>>>>> On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan <pan3...@gmail.com> wrote:
>>>>>
>>>>>> +1 (non-binding)
>>>>>>
>>>>>> Thanks,
>>>>>> Cheng Pan
>>>>>>
>>>>>> On Sat, Apr 27, 2024 at 9:29 AM Holden Karau <holden.ka...@gmail.com>
>>>>>> wrote:
>>>>>> >
>>>>>> > +1
>>>>>> >
>>>>>> > Twitter: https://twitter.com/holdenkarau
>>>>>> > Books (Learning Spark, High Performance Spark, etc.):
>>>>>> https://amzn.to/2MaRAG9
>>>>>> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>>>> >
>>>>>> >
>>>>>> > On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh <vii...@gmail.com>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> +1
>>>>>> >>
>>>>>> >> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun <
>>>>>> dongj...@apache.org> wrote:
>>>>>> >> >
>>>>>> >> > I'll start with my +1.
>>>>>> >> >
>>>>>> >> > Dongjoon.
>>>>>> >> >
>>>>>> >> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
>>>>>> >> > > Please vote on SPARK-46122 to set
>>>>>> spark.sql.legacy.createHiveTableByDefault
>>>>>> >> > > to `false` by default. The technical scope is defined in the
>>>>>> following PR.
>>>>>> >> > >
>>>>>> >> > > - DISCUSSION:
>>>>>> >> > >
>>>>>> https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
>>>>>> >> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
>>>>>> >> > > - PR: https://github.com/apache/spark/pull/46207
>>>>>> >> > >
>>>>>> >> > > The vote is open until April 30th 1AM (PST) and passes
>>>>>> >> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1
>>>>>> votes.
>>>>>> >> > >
>>>>>> >> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false
>>>>>> by default
>>>>>> >> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault
>>>>>> because ...
>>>>>> >> > >
>>>>>> >> > > Thank you in advance.
>>>>>> >> > >
>>>>>> >> > > Dongjoon
>>>>>> >> > >
>>>>>> >> >
>>>>>> >> >
>>>>>> ---------------------------------------------------------------------
>>>>>> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>> >> >
>>>>>> >>
>>>>>> >>
>>>>>> ---------------------------------------------------------------------
>>>>>> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>> >>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>>

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Reply via email to