Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

L. C. Hsieh Thu, 25 Apr 2024 21:55:30 -0700

+1

On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang <yumw...@apache.org> wrote:


> +1
>
> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>
>> Of course, I can't think of a scenario of thousands of tables with single
>> in memory Spark cluster with in memory catalog.
>> Thanks for the help!
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏<
>> mich.talebza...@gmail.com>:
>>
>>>
>>>
>>> Agreed. In scenarios where most of the interactions with the catalog are
>>> related to query planning, saving and metadata management, the choice of
>>> catalog implementation may have less impact on query runtime performance.
>>> This is because the time spent on metadata operations is generally
>>> minimal compared to the time spent on actual data fetching, processing, and
>>> computation.
>>> However, if we consider scalability and reliability concerns, especially
>>> as the size and complexity of data and query workload grow. While an
>>> in-memory catalog may offer excellent performance for smaller workloads,
>>> it will face limitations in handling larger-scale deployments with
>>> thousands of tables, partitions, and users. Additionally, durability and
>>> persistence are crucial considerations, particularly in production
>>> environments where data integrity
>>> and availability are crucial. In-memory catalog implementations may lack
>>> durability, meaning that metadata changes could be lost in the event of a
>>> system failure or restart. Therefore, while in-memory catalog
>>> implementations can provide speed and efficiency for certain use cases, we
>>> ought to consider the requirements for scalability, reliability, and data
>>> durability when choosing a catalog solution for production deployments. In
>>> many cases, a combination of in-memory and disk-based catalog solutions may
>>> offer the best balance of performance and resilience for demanding large
>>> scale workloads.
>>>
>>>
>>> HTH
>>>
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>>>
>>>> Of course, but it's in memory and not persisted which is much faster,
>>>> and as I said- I believe that most of the interaction with it is during the
>>>> planning and save and not actual query run operations, and they are short
>>>> and minimal compared to data fetching and manipulation so I don't believe
>>>> it will have big impact on query run...
>>>>
>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
>>>> mich.talebza...@gmail.com>:
>>>>
>>>>> Well, I will be surprised because Derby database is single threaded
>>>>> and won't be much of a use here.
>>>>>
>>>>> Most Hive metastore in the commercial world utilise postgres or Oracle
>>>>> for metastore that are battle proven, replicated and backed up.
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>
>>>>>
>>>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Yes, in memory hive catalog backed by local Derby DB.
>>>>>> And again, I presume that most metadata related parts are during
>>>>>> planning and not actual run, so I don't see why it should strongly affect
>>>>>> query performance.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>>
>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>>>>>> mich.talebza...@gmail.com>:
>>>>>>
>>>>>>> With regard to your point below
>>>>>>>
>>>>>>> "The thing I'm missing is this: let's say that the output format I
>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>>>>>> Where
>>>>>>> does the catalog implementation (which holds metadata afaik, same 
>>>>>>> metadata
>>>>>>> that iceberg and delta lake save for their tables about their columns)
>>>>>>> comes into play and why should it affect performance? "
>>>>>>>
>>>>>>> The catalog implementation comes into play regardless of the output
>>>>>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>>>>>>> responsible for managing metadata about the datasets, tables, schemas, 
>>>>>>> and
>>>>>>> other objects stored in aforementioned formats. Even though Delta Lake 
>>>>>>> and
>>>>>>> Iceberg have their metadata management mechanisms internally, they still
>>>>>>> rely on the catalog for providing a unified interface for accessing and
>>>>>>> manipulating metadata across different storage formats.
>>>>>>>
>>>>>>> "Another thing is that if I understand correctly, and I might be
>>>>>>> totally wrong here, the internal spark catalog is a local installation 
>>>>>>> of
>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with
>>>>>>> anything"
>>>>>>>
>>>>>>> .I don't understand this. Do you mean a Derby database?
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>> expert opinions (Werner
>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the detailed answer.
>>>>>>>> The thing I'm missing is this: let's say that the output format I
>>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. 
>>>>>>>> Where
>>>>>>>> does the catalog implementation (which holds metadata afaik, same 
>>>>>>>> metadata
>>>>>>>> that iceberg and delta lake save for their tables about their columns)
>>>>>>>> comes into play and why should it affect performance?
>>>>>>>> Another thing is that if I understand correctly, and I might be
>>>>>>>> totally wrong here, the internal spark catalog is a local installation 
>>>>>>>> of
>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with
>>>>>>>> anything.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>>
>>>>>>>>> My take regarding your question is that your mileage varies so to
>>>>>>>>> speak.
>>>>>>>>>
>>>>>>>>> 1) Hive provides a more mature and widely adopted catalog solution
>>>>>>>>> that integrates well with other components in the Hadoop ecosystem, 
>>>>>>>>> such as
>>>>>>>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), 
>>>>>>>>> using
>>>>>>>>> Hive may offer better compatibility and interoperability.
>>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users
>>>>>>>>> who are accustomed to traditional RDBMs. If your use case involves 
>>>>>>>>> complex
>>>>>>>>> SQL queries or existing SQL-based workflows, using Hive may be 
>>>>>>>>> advantageous.
>>>>>>>>> 3) If you are looking for performance, spark's native catalog
>>>>>>>>> tends to offer better performance for certain workloads, particularly 
>>>>>>>>> those
>>>>>>>>> that involve iterative processing or complex data transformations.(my
>>>>>>>>> understanding). Spark's in-memory processing capabilities and 
>>>>>>>>> optimizations
>>>>>>>>> make it well-suited for interactive analytics and machine learning
>>>>>>>>> tasks.(my favourite)
>>>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark
>>>>>>>>> for data processing and analytics, using Spark's native catalog may
>>>>>>>>> simplify workflow management and reduce overhead, Spark's  tight
>>>>>>>>> integration with its catalog allows for seamless interaction with 
>>>>>>>>> Spark
>>>>>>>>> applications and libraries.
>>>>>>>>> 5) There seems to be some similarity with spark catalog and
>>>>>>>>> Databricks unity catalog, so that may favour the choice.
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Mich Talebzadeh,
>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>> FinCrime
>>>>>>>>> London
>>>>>>>>> United Kingdom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>>> note
>>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>>>> expert opinions (Werner
>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I will also appreciate some material that describes the
>>>>>>>>>> differences between Spark native tables vs hive tables and why each 
>>>>>>>>>> should
>>>>>>>>>> be used...
>>>>>>>>>>
>>>>>>>>>> Thanks
>>>>>>>>>> Nimrod
>>>>>>>>>>
>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>>>>>>>>> mich.talebza...@gmail.com>:
>>>>>>>>>>
>>>>>>>>>>> I see a statement made as below  and I quote
>>>>>>>>>>>
>>>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of
>>>>>>>>>>> this
>>>>>>>>>>> configuration from `true` to `false` to use Spark native tables
>>>>>>>>>>> because
>>>>>>>>>>> we support better."
>>>>>>>>>>>
>>>>>>>>>>> Can you please elaborate on the above specifically with regard
>>>>>>>>>>> to the phrase ".. because
>>>>>>>>>>> we support better."
>>>>>>>>>>>
>>>>>>>>>>> Are you referring to the performance of Spark catalog (I believe
>>>>>>>>>>> it is internal) or integration with Spark?
>>>>>>>>>>>
>>>>>>>>>>> HTH
>>>>>>>>>>>
>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>> Technologist | Architect | Data Engineer  | Generative AI |
>>>>>>>>>>> FinCrime
>>>>>>>>>>> London
>>>>>>>>>>> United Kingdom
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>> essential to
>>>>>>>>>>> note that, as with any advice, quote "one test result is worth
>>>>>>>>>>> one-thousand expert opinions (Werner
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> +1
>>>>>>>>>>>>>
>>>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Kent Yao
>>>>>>>>>>>>>
>>>>>>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 14:39写道：
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Hi, All.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > It's great to see community activities to polish 4.0.0 more
>>>>>>>>>>>>> and more.
>>>>>>>>>>>>> > Thank you all.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you
>>>>>>>>>>>>> from the subtasks
>>>>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0),
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>>>>>>>>>>>> >    Set `spark.sql.legacy.createHiveTableByDefault` to
>>>>>>>>>>>>> `false` by default
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax
>>>>>>>>>>>>> without
>>>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive`
>>>>>>>>>>>>> table.
>>>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value
>>>>>>>>>>>>> of this
>>>>>>>>>>>>> > configuration from `true` to `false` to use Spark native
>>>>>>>>>>>>> tables because
>>>>>>>>>>>>> > we support better.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > In other words, Spark will use the value of
>>>>>>>>>>>>> `spark.sql.sources.default`
>>>>>>>>>>>>> > as the table provider instead of `Hive` like the other Spark
>>>>>>>>>>>>> APIs. Of course,
>>>>>>>>>>>>> > the users can get all the legacy behavior by setting back to
>>>>>>>>>>>>> `true`.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Historically, this behavior change was merged once at Apache
>>>>>>>>>>>>> Spark 3.0.0
>>>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during the
>>>>>>>>>>>>> 3.0.0 RC period.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider
>>>>>>>>>>>>> for CREATE TABLE
>>>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default
>>>>>>>>>>>>> datasource as
>>>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this
>>>>>>>>>>>>> and defined it
>>>>>>>>>>>>> > as one of legacy behavior via this configuration via reused
>>>>>>>>>>>>> ID, SPARK-30098.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > 2020-12-01:
>>>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default
>>>>>>>>>>>>> datasource as
>>>>>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Last year, we received two additional requests twice to
>>>>>>>>>>>>> switch this because
>>>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the
>>>>>>>>>>>>> future direction.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>>>>>>>>>>>> >
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR
>>>>>>>>>>>>> which is one line of main
>>>>>>>>>>>>> > code, one line of migration guide, and a few lines of test
>>>>>>>>>>>>> code.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Dongjoon.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Reply via email to