Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Mich Talebzadeh Thu, 25 Apr 2024 07:52:34 -0700

Well, I will be surprised because Derby database is single threaded and
won't be much of a use here.


Most Hive metastore in the commercial world utilise postgres or Oracle for
metastore that are battle proven, replicated and backed up.

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com> wrote:

> Yes, in memory hive catalog backed by local Derby DB.
> And again, I presume that most metadata related parts are during planning
> and not actual run, so I don't see why it should strongly affect query
> performance.
>
> Thanks,
>
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>> With regard to your point below
>>
>> "The thing I'm missing is this: let's say that the output format I choose
>> is delta lake or iceberg or whatever format that uses parquet. Where does
>> the catalog implementation (which holds metadata afaik, same metadata that
>> iceberg and delta lake save for their tables about their columns) comes
>> into play and why should it affect performance? "
>>
>> The catalog implementation comes into play regardless of the output
>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>> responsible for managing metadata about the datasets, tables, schemas, and
>> other objects stored in aforementioned formats. Even though Delta Lake and
>> Iceberg have their metadata management mechanisms internally, they still
>> rely on the catalog for providing a unified interface for accessing and
>> manipulating metadata across different storage formats.
>>
>> "Another thing is that if I understand correctly, and I might be totally
>> wrong here, the internal spark catalog is a local installation of hive
>> metastore anyway, so I'm not sure what the catalog has to do with anything"
>>
>> .I don't understand this. Do you mean a Derby database?
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>>
>>> Thanks for the detailed answer.
>>> The thing I'm missing is this: let's say that the output format I choose
>>> is delta lake or iceberg or whatever format that uses parquet. Where does
>>> the catalog implementation (which holds metadata afaik, same metadata that
>>> iceberg and delta lake save for their tables about their columns) comes
>>> into play and why should it affect performance?
>>> Another thing is that if I understand correctly, and I might be totally
>>> wrong here, the internal spark catalog is a local installation of hive
>>> metastore anyway, so I'm not sure what the catalog has to do with anything.
>>>
>>> Thanks!
>>>
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>
>>>> My take regarding your question is that your mileage varies so to
>>>> speak.
>>>>
>>>> 1) Hive provides a more mature and widely adopted catalog solution that
>>>> integrates well with other components in the Hadoop ecosystem, such as
>>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
>>>> Hive may offer better compatibility and interoperability.
>>>> 2) Hive provides a SQL-like interface that is familiar to users who are
>>>> accustomed to traditional RDBMs. If your use case involves complex SQL
>>>> queries or existing SQL-based workflows, using Hive may be advantageous.
>>>> 3) If you are looking for performance, spark's native catalog tends to
>>>> offer better performance for certain workloads, particularly those that
>>>> involve iterative processing or complex data transformations.(my
>>>> understanding). Spark's in-memory processing capabilities and optimizations
>>>> make it well-suited for interactive analytics and machine learning
>>>> tasks.(my favourite)
>>>> 4) Integration with Spark Workflows: If you primarily use Spark for
>>>> data processing and analytics, using Spark's native catalog may simplify
>>>> workflow management and reduce overhead, Spark's  tight integration with
>>>> its catalog allows for seamless interaction with Spark applications and
>>>> libraries.
>>>> 5) There seems to be some similarity with spark catalog and
>>>> Databricks unity catalog, so that may favour the choice.
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com>
>>>> wrote:
>>>>
>>>>> I will also appreciate some material that describes the differences
>>>>> between Spark native tables vs hive tables and why each should be used...
>>>>>
>>>>> Thanks
>>>>> Nimrod
>>>>>
>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>>>> mich.talebza...@gmail.com>:
>>>>>
>>>>>> I see a statement made as below  and I quote
>>>>>>
>>>>>> "The proposal of SPARK-46122 is to switch the default value of this
>>>>>> configuration from `true` to `false` to use Spark native tables
>>>>>> because
>>>>>> we support better."
>>>>>>
>>>>>> Can you please elaborate on the above specifically with regard to the
>>>>>> phrase ".. because
>>>>>> we support better."
>>>>>>
>>>>>> Are you referring to the performance of Spark catalog (I believe it
>>>>>> is internal) or integration with Spark?
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>> expert opinions (Werner
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>
>>>>>>
>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> +1
>>>>>>>
>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Kent Yao
>>>>>>>>
>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 14:39写道：
>>>>>>>> >
>>>>>>>> > Hi, All.
>>>>>>>> >
>>>>>>>> > It's great to see community activities to polish 4.0.0 more and
>>>>>>>> more.
>>>>>>>> > Thank you all.
>>>>>>>> >
>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from the
>>>>>>>> subtasks
>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0),
>>>>>>>> >
>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>>>>>>> >    Set `spark.sql.legacy.createHiveTableByDefault` to `false` by
>>>>>>>> default
>>>>>>>> >
>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax
>>>>>>>> without
>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive`
>>>>>>>> table.
>>>>>>>> > The proposal of SPARK-46122 is to switch the default value of this
>>>>>>>> > configuration from `true` to `false` to use Spark native tables
>>>>>>>> because
>>>>>>>> > we support better.
>>>>>>>> >
>>>>>>>> > In other words, Spark will use the value of
>>>>>>>> `spark.sql.sources.default`
>>>>>>>> > as the table provider instead of `Hive` like the other Spark
>>>>>>>> APIs. Of course,
>>>>>>>> > the users can get all the legacy behavior by setting back to
>>>>>>>> `true`.
>>>>>>>> >
>>>>>>>> > Historically, this behavior change was merged once at Apache
>>>>>>>> Spark 3.0.0
>>>>>>>> > preparation via SPARK-30098 already, but reverted during the
>>>>>>>> 3.0.0 RC period.
>>>>>>>> >
>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for
>>>>>>>> CREATE TABLE
>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource
>>>>>>>> as
>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>> >
>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this and
>>>>>>>> defined it
>>>>>>>> > as one of legacy behavior via this configuration via reused ID,
>>>>>>>> SPARK-30098.
>>>>>>>> >
>>>>>>>> > 2020-12-01:
>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default
>>>>>>>> datasource as
>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>> >
>>>>>>>> > Last year, we received two additional requests twice to switch
>>>>>>>> this because
>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the
>>>>>>>> future direction.
>>>>>>>> >
>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>>>>>>> >
>>>>>>>> >
>>>>>>>> > WDYT? The technical scope is defined in the following PR which is
>>>>>>>> one line of main
>>>>>>>> > code, one line of migration guide, and a few lines of test code.
>>>>>>>> >
>>>>>>>> > - https://github.com/apache/spark/pull/46207
>>>>>>>> >
>>>>>>>> > Dongjoon.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>>>
>>>>>>>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Reply via email to