Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Dongjoon Hyun Fri, 26 Apr 2024 09:36:55 -0700

Thank you, Kent, Wenchen, Mich, Nimrod, Yuming, LiangChi. I'll start a vote.


To Mich, for your question, Apache Spark has a long history of converting
Hive-provider tables into Spark's datasource tables to handle better in a
Spark way.

> Can you please elaborate on the above specifically with regard to the
phrase
> ".. because we support better."

Here are the subset of configurations you can take a look at.
- spark.sql.hive.convertMetastoreParquet (`true` since Spark 1.3.0)
- spark.sql.hive.convertMetastoreOrc (`true` since Spark 2.4.0)
- spark.sql.hive.convertInsertingPartitionedTable (`true` since Spark 3.0.0)
- spark.sql.hive.convertMetastoreInsertDir (`true` since Spark 3.3.0)
- spark.sql.hive.convertInsertingUnpartitionedTable (`true` since 4.0.0)

Dongjoon.


On Fri, Apr 26, 2024 at 12:24 AM L. C. Hsieh <vii...@gmail.com> wrote:

> +1
>
> On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang <yumw...@apache.org> wrote:
>
>> +1
>>
>> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek <ofek.nim...@gmail.com>
>> wrote:
>>
>>> Of course, I can't think of a scenario of thousands of tables with
>>> single in memory Spark cluster with in memory catalog.
>>> Thanks for the help!
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>
>>>>
>>>>
>>>> Agreed. In scenarios where most of the interactions with the catalog
>>>> are related to query planning, saving and metadata management, the choice
>>>> of catalog implementation may have less impact on query runtime 
>>>> performance.
>>>> This is because the time spent on metadata operations is generally
>>>> minimal compared to the time spent on actual data fetching, processing, and
>>>> computation.
>>>> However, if we consider scalability and reliability concerns,
>>>> especially as the size and complexity of data and query workload grow.
>>>> While an in-memory catalog may offer excellent performance for smaller
>>>> workloads,
>>>> it will face limitations in handling larger-scale deployments with
>>>> thousands of tables, partitions, and users. Additionally, durability and
>>>> persistence are crucial considerations, particularly in production
>>>> environments where data integrity
>>>> and availability are crucial. In-memory catalog implementations may
>>>> lack durability, meaning that metadata changes could be lost in the event
>>>> of a system failure or restart. Therefore, while in-memory catalog
>>>> implementations can provide speed and efficiency for certain use cases, we
>>>> ought to consider the requirements for scalability, reliability, and data
>>>> durability when choosing a catalog solution for production deployments. In
>>>> many cases, a combination of in-memory and disk-based catalog solutions may
>>>> offer the best balance of performance and resilience for demanding large
>>>> scale workloads.
>>>>
>>>>
>>>> HTH
>>>>
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <ofek.nim...@gmail.com>
>>>> wrote:
>>>>
>>>>> Of course, but it's in memory and not persisted which is much faster,
>>>>> and as I said- I believe that most of the interaction with it is during 
>>>>> the
>>>>> planning and save and not actual query run operations, and they are short
>>>>> and minimal compared to data fetching and manipulation so I don't believe
>>>>> it will have big impact on query run...
>>>>>
>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
>>>>> mich.talebza...@gmail.com>:
>>>>>
>>>>>> Well, I will be surprised because Derby database is single threaded
>>>>>> and won't be much of a use here.
>>>>>>
>>>>>> Most Hive metastore in the commercial world utilise postgres or
>>>>>> Oracle for metastore that are battle proven, replicated and backed up.
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>> expert opinions (Werner
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>
>>>>>>
>>>>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, in memory hive catalog backed by local Derby DB.
>>>>>>> And again, I presume that most metadata related parts are during
>>>>>>> planning and not actual run, so I don't see why it should strongly 
>>>>>>> affect
>>>>>>> query performance.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>
>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>
>>>>>>>> With regard to your point below
>>>>>>>>
>>>>>>>> "The thing I'm missing is this: let's say that the output format I
>>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>>>>>>> Where
>>>>>>>> does the catalog implementation (which holds metadata afaik, same 
>>>>>>>> metadata
>>>>>>>> that iceberg and delta lake save for their tables about their columns)
>>>>>>>> comes into play and why should it affect performance? "
>>>>>>>>
>>>>>>>> The catalog implementation comes into play regardless of the output
>>>>>>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>>>>>>>> responsible for managing metadata about the datasets, tables, schemas, 
>>>>>>>> and
>>>>>>>> other objects stored in aforementioned formats. Even though Delta Lake 
>>>>>>>> and
>>>>>>>> Iceberg have their metadata management mechanisms internally, they 
>>>>>>>> still
>>>>>>>> rely on the catalog for providing a unified interface for accessing and
>>>>>>>> manipulating metadata across different storage formats.
>>>>>>>>
>>>>>>>> "Another thing is that if I understand correctly, and I might be
>>>>>>>> totally wrong here, the internal spark catalog is a local installation 
>>>>>>>> of
>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with
>>>>>>>> anything"
>>>>>>>>
>>>>>>>> .I don't understand this. Do you mean a Derby database?
>>>>>>>>
>>>>>>>> HTH
>>>>>>>>
>>>>>>>> Mich Talebzadeh,
>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>>>>> London
>>>>>>>> United Kingdom
>>>>>>>>
>>>>>>>>
>>>>>>>>    view my Linkedin profile
>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>
>>>>>>>>
>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>> note
>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>>> expert opinions (Werner
>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the detailed answer.
>>>>>>>>> The thing I'm missing is this: let's say that the output format I
>>>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>>>>>>>> Where
>>>>>>>>> does the catalog implementation (which holds metadata afaik, same 
>>>>>>>>> metadata
>>>>>>>>> that iceberg and delta lake save for their tables about their columns)
>>>>>>>>> comes into play and why should it affect performance?
>>>>>>>>> Another thing is that if I understand correctly, and I might be
>>>>>>>>> totally wrong here, the internal spark catalog is a local 
>>>>>>>>> installation of
>>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with
>>>>>>>>> anything.
>>>>>>>>>
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>>>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>>>
>>>>>>>>>> My take regarding your question is that your mileage varies so to
>>>>>>>>>> speak.
>>>>>>>>>>
>>>>>>>>>> 1) Hive provides a more mature and widely adopted catalog
>>>>>>>>>> solution that integrates well with other components in the Hadoop
>>>>>>>>>> ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop centric 
>>>>>>>>>> S(say
>>>>>>>>>> on-premise), using Hive may offer better compatibility and
>>>>>>>>>> interoperability.
>>>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users
>>>>>>>>>> who are accustomed to traditional RDBMs. If your use case involves 
>>>>>>>>>> complex
>>>>>>>>>> SQL queries or existing SQL-based workflows, using Hive may be 
>>>>>>>>>> advantageous.
>>>>>>>>>> 3) If you are looking for performance, spark's native catalog
>>>>>>>>>> tends to offer better performance for certain workloads, 
>>>>>>>>>> particularly those
>>>>>>>>>> that involve iterative processing or complex data transformations.(my
>>>>>>>>>> understanding). Spark's in-memory processing capabilities and 
>>>>>>>>>> optimizations
>>>>>>>>>> make it well-suited for interactive analytics and machine learning
>>>>>>>>>> tasks.(my favourite)
>>>>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark
>>>>>>>>>> for data processing and analytics, using Spark's native catalog may
>>>>>>>>>> simplify workflow management and reduce overhead, Spark's  tight
>>>>>>>>>> integration with its catalog allows for seamless interaction with 
>>>>>>>>>> Spark
>>>>>>>>>> applications and libraries.
>>>>>>>>>> 5) There seems to be some similarity with spark catalog and
>>>>>>>>>> Databricks unity catalog, so that may favour the choice.
>>>>>>>>>>
>>>>>>>>>> HTH
>>>>>>>>>>
>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>>> FinCrime
>>>>>>>>>> London
>>>>>>>>>> United Kingdom
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>>>> note
>>>>>>>>>> that, as with any advice, quote "one test result is worth 
>>>>>>>>>> one-thousand
>>>>>>>>>> expert opinions (Werner
>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I will also appreciate some material that describes the
>>>>>>>>>>> differences between Spark native tables vs hive tables and why each 
>>>>>>>>>>> should
>>>>>>>>>>> be used...
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>> Nimrod
>>>>>>>>>>>
>>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>>>>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>> I see a statement made as below  and I quote
>>>>>>>>>>>>
>>>>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of
>>>>>>>>>>>> this
>>>>>>>>>>>> configuration from `true` to `false` to use Spark native tables
>>>>>>>>>>>> because
>>>>>>>>>>>> we support better."
>>>>>>>>>>>>
>>>>>>>>>>>> Can you please elaborate on the above specifically with regard
>>>>>>>>>>>> to the phrase ".. because
>>>>>>>>>>>> we support better."
>>>>>>>>>>>>
>>>>>>>>>>>> Are you referring to the performance of Spark catalog (I
>>>>>>>>>>>> believe it is internal) or integration with Spark?
>>>>>>>>>>>>
>>>>>>>>>>>> HTH
>>>>>>>>>>>>
>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>>>>> FinCrime
>>>>>>>>>>>> London
>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>>> essential to
>>>>>>>>>>>> note that, as with any advice, quote "one test result is worth
>>>>>>>>>>>> one-thousand expert opinions (Werner
>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Kent Yao
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四
>>>>>>>>>>>>>> 14:39写道：
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Hi, All.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > It's great to see community activities to polish 4.0.0 more
>>>>>>>>>>>>>> and more.
>>>>>>>>>>>>>> > Thank you all.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you
>>>>>>>>>>>>>> from the subtasks
>>>>>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0),
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>>>>>>>>>>>>> >    Set `spark.sql.legacy.createHiveTableByDefault` to
>>>>>>>>>>>>>> `false` by default
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL
>>>>>>>>>>>>>> syntax without
>>>>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to
>>>>>>>>>>>>>> `Hive` table.
>>>>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value
>>>>>>>>>>>>>> of this
>>>>>>>>>>>>>> > configuration from `true` to `false` to use Spark native
>>>>>>>>>>>>>> tables because
>>>>>>>>>>>>>> > we support better.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > In other words, Spark will use the value of
>>>>>>>>>>>>>> `spark.sql.sources.default`
>>>>>>>>>>>>>> > as the table provider instead of `Hive` like the other
>>>>>>>>>>>>>> Spark APIs. Of course,
>>>>>>>>>>>>>> > the users can get all the legacy behavior by setting back
>>>>>>>>>>>>>> to `true`.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Historically, this behavior change was merged once at
>>>>>>>>>>>>>> Apache Spark 3.0.0
>>>>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during
>>>>>>>>>>>>>> the 3.0.0 RC period.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider
>>>>>>>>>>>>>> for CREATE TABLE
>>>>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default
>>>>>>>>>>>>>> datasource as
>>>>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this
>>>>>>>>>>>>>> and defined it
>>>>>>>>>>>>>> > as one of legacy behavior via this configuration via reused
>>>>>>>>>>>>>> ID, SPARK-30098.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > 2020-12-01:
>>>>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>>>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default
>>>>>>>>>>>>>> datasource as
>>>>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Last year, we received two additional requests twice to
>>>>>>>>>>>>>> switch this because
>>>>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for
>>>>>>>>>>>>>> the future direction.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>>>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR
>>>>>>>>>>>>>> which is one line of main
>>>>>>>>>>>>>> > code, one line of migration guide, and a few lines of test
>>>>>>>>>>>>>> code.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Dongjoon.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Reply via email to