Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Ye Xianjin Tue, 30 Apr 2024 00:30:44 -0700

Sent from my iPhone

On Apr 30, 2024, at 3:23 PM, DB Tsai <[email protected]> wrote:

+1

On Apr 29, 2024, at 8:01 PM, Wenchen Fan <[email protected]> wrote:

To add more color:

Spark data source table and Hive Serde table are both stored in the Hive metastore and keep the data files in the table directory. The only difference is they have different "table provider", which means Spark will use different reader/writer. Ideally the Spark native data source reader/writer is faster than the Hive Serde ones.

What's more, the default format of Hive Serde is text. I don't think people want to use text format tables in production. Most people will add `STORED AS parquet` or `USING parquet` explicitly. By setting this config to false, we have a more reasonable default behavior: creating Parquet tables (or whatever is specified by `spark.sql.sources.default`).

On Tue, Apr 30, 2024 at 10:45 AM Wenchen Fan <[email protected]> wrote:
@Mich Talebzadeh there seems to be a misunderstanding here. The Spark native data source table is still stored in the Hive metastore, it's just that Spark will use a different (and faster) reader/writer for it. `hive-site.xml` should work as it is today.

On Tue, Apr 30, 2024 at 5:23 AM Hyukjin Kwon <[email protected]> wrote:
+1

It's a legacy conf that we should eventually remove it away. Spark should create Spark table by default, not Hive table.

Mich, for your workload, you can simply switch that conf off if it concerns you. We also enabled ANSI as well (that you agreed on). It's a bit akwakrd to stop in the middle for this compatibility reason during making Spark sound. The compatibility has been tested in production for a long time so I don't see any particular issue about the compatibility case you mentioned.

On Mon, Apr 29, 2024 at 2:08 AM Mich Talebzadeh <[email protected]> wrote:

Hi @Wenchen Fan

Thanks for your response. I believe we have not had enough time to "DISCUSS" this matter.

Currently in order to make Spark take advantage of Hive, I create a soft link in $SPARK_HOME/conf. FYI, my spark version is 3.4.0 and Hive is 3.1.1

/opt/spark/conf/hive-site.xml -> /data6/hduser/hive-3.1.1/conf/hive-site.xml

This works fine for me in my lab. So in the future if we opt to use the setting "spark.sql.legacy.createHiveTableByDefault" to False, there will not be a need for this logical link.?
On the face of it, this looks fine but in real life it may require a number of changes to the old scripts. Hence my concern.
As a matter of interest has anyone liaised with the Hive team to ensure they have introduced the additional changes you outlined?

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer | Generative AI | FinCrime
London
United Kingdom

view my Linkedin profile

https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner Von Braun)".

On Sun, 28 Apr 2024 at 09:34, Wenchen Fan <[email protected]> wrote:
@Mich Talebzadeh thanks for sharing your concern!

Note: creating Spark native data source tables is usually Hive compatible as well, unless we use features that Hive does not support (TIMESTAMP NTZ, ANSI INTERVAL, etc.). I think it's a better default to create Spark native table in this case, instead of creating Hive table and fail.

On Sat, Apr 27, 2024 at 12:46 PM Cheng Pan <[email protected]> wrote:
+1 (non-binding)

Thanks,
Cheng Pan

On Sat, Apr 27, 2024 at 9:29 AM Holden Karau <[email protected]> wrote:
>
> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Fri, Apr 26, 2024 at 12:06 PM L. C. Hsieh <[email protected]> wrote:
>>
>> +1
>>
>> On Fri, Apr 26, 2024 at 10:01 AM Dongjoon Hyun <[email protected]> wrote:
>> >
>> > I'll start with my +1.
>> >
>> > Dongjoon.
>> >
>> > On 2024/04/26 16:45:51 Dongjoon Hyun wrote:
>> > > Please vote on SPARK-46122 to set spark.sql.legacy.createHiveTableByDefault
>> > > to `false` by default. The technical scope is defined in the following PR.
>> > >
>> > > - DISCUSSION:
>> > > https://lists.apache.org/thread/ylk96fg4lvn6klxhj6t6yh42lyqb8wmd
>> > > - JIRA: https://issues.apache.org/jira/browse/SPARK-46122
>> > > - PR: https://github.com/apache/spark/pull/46207
>> > >
>> > > The vote is open until April 30th 1AM (PST) and passes
>> > > if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes.
>> > >
>> > > [ ] +1 Set spark.sql.legacy.createHiveTableByDefault to false by default
>> > > [ ] -1 Do not change spark.sql.legacy.createHiveTableByDefault because ...
>> > >
>> > > Thank you in advance.
>> > >
>> > > Dongjoon
>> > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe e-mail: [email protected]
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: [email protected]
>>

---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]

Re: [VOTE] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Technologist | Architect | Data Engineer | Generative AI | FinCrime

Reply via email to