Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Nimrod Ofek Thu, 25 Apr 2024 08:32:46 -0700

Of course, but it's in memory and not persisted which is much faster, and
as I said- I believe that most of the interaction with it is during the
planning and save and not actual query run operations, and they are short
and minimal compared to data fetching and manipulation so I don't believe
it will have big impact on query run...


בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh ‏<
[email protected]>:

> Well, I will be surprised because Derby database is single threaded and
> won't be much of a use here.
>
> Most Hive metastore in the commercial world utilise postgres or Oracle for
> metastore that are battle proven, replicated and backed up.
>
> Mich Talebzadeh,
> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <[email protected]> wrote:
>
>> Yes, in memory hive catalog backed by local Derby DB.
>> And again, I presume that most metadata related parts are during planning
>> and not actual run, so I don't see why it should strongly affect query
>> performance.
>>
>> Thanks,
>>
>>
>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<
>> [email protected]>:
>>
>>> With regard to your point below
>>>
>>> "The thing I'm missing is this: let's say that the output format I
>>> choose is delta lake or iceberg or whatever format that uses parquet. Where
>>> does the catalog implementation (which holds metadata afaik, same metadata
>>> that iceberg and delta lake save for their tables about their columns)
>>> comes into play and why should it affect performance? "
>>>
>>> The catalog implementation comes into play regardless of the output
>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is
>>> responsible for managing metadata about the datasets, tables, schemas, and
>>> other objects stored in aforementioned formats. Even though Delta Lake and
>>> Iceberg have their metadata management mechanisms internally, they still
>>> rely on the catalog for providing a unified interface for accessing and
>>> manipulating metadata across different storage formats.
>>>
>>> "Another thing is that if I understand correctly, and I might be totally
>>> wrong here, the internal spark catalog is a local installation of hive
>>> metastore anyway, so I'm not sure what the catalog has to do with anything"
>>>
>>> .I don't understand this. Do you mean a Derby database?
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <[email protected]> wrote:
>>>
>>>> Thanks for the detailed answer.
>>>> The thing I'm missing is this: let's say that the output format I
>>>> choose is delta lake or iceberg or whatever format that uses parquet. Where
>>>> does the catalog implementation (which holds metadata afaik, same metadata
>>>> that iceberg and delta lake save for their tables about their columns)
>>>> comes into play and why should it affect performance?
>>>> Another thing is that if I understand correctly, and I might be totally
>>>> wrong here, the internal spark catalog is a local installation of hive
>>>> metastore anyway, so I'm not sure what the catalog has to do with anything.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
>>>> [email protected]>:
>>>>
>>>>> My take regarding your question is that your mileage varies so to
>>>>> speak.
>>>>>
>>>>> 1) Hive provides a more mature and widely adopted catalog solution
>>>>> that integrates well with other components in the Hadoop ecosystem, such 
>>>>> as
>>>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
>>>>> Hive may offer better compatibility and interoperability.
>>>>> 2) Hive provides a SQL-like interface that is familiar to users who
>>>>> are accustomed to traditional RDBMs. If your use case involves complex SQL
>>>>> queries or existing SQL-based workflows, using Hive may be advantageous.
>>>>> 3) If you are looking for performance, spark's native catalog tends to
>>>>> offer better performance for certain workloads, particularly those that
>>>>> involve iterative processing or complex data transformations.(my
>>>>> understanding). Spark's in-memory processing capabilities and 
>>>>> optimizations
>>>>> make it well-suited for interactive analytics and machine learning
>>>>> tasks.(my favourite)
>>>>> 4) Integration with Spark Workflows: If you primarily use Spark for
>>>>> data processing and analytics, using Spark's native catalog may simplify
>>>>> workflow management and reduce overhead, Spark's  tight integration with
>>>>> its catalog allows for seamless interaction with Spark applications and
>>>>> libraries.
>>>>> 5) There seems to be some similarity with spark catalog and
>>>>> Databricks unity catalog, so that may favour the choice.
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>
>>>>>
>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I will also appreciate some material that describes the differences
>>>>>> between Spark native tables vs hive tables and why each should be used...
>>>>>>
>>>>>> Thanks
>>>>>> Nimrod
>>>>>>
>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>>>>> [email protected]>:
>>>>>>
>>>>>>> I see a statement made as below  and I quote
>>>>>>>
>>>>>>> "The proposal of SPARK-46122 is to switch the default value of this
>>>>>>> configuration from `true` to `false` to use Spark native tables
>>>>>>> because
>>>>>>> we support better."
>>>>>>>
>>>>>>> Can you please elaborate on the above specifically with regard to
>>>>>>> the phrase ".. because
>>>>>>> we support better."
>>>>>>>
>>>>>>> Are you referring to the performance of Spark catalog (I believe it
>>>>>>> is internal) or integration with Spark?
>>>>>>>
>>>>>>> HTH
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>> expert opinions (Werner
>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>
>>>>>>>
>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> +1
>>>>>>>>
>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Kent Yao
>>>>>>>>>
>>>>>>>>> Dongjoon Hyun <[email protected]> 于2024年4月25日周四 14:39写道：
>>>>>>>>> >
>>>>>>>>> > Hi, All.
>>>>>>>>> >
>>>>>>>>> > It's great to see community activities to polish 4.0.0 more and
>>>>>>>>> more.
>>>>>>>>> > Thank you all.
>>>>>>>>> >
>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from
>>>>>>>>> the subtasks
>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0),
>>>>>>>>> >
>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>>>>>>>> >    Set `spark.sql.legacy.createHiveTableByDefault` to `false` by
>>>>>>>>> default
>>>>>>>>> >
>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax
>>>>>>>>> without
>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive`
>>>>>>>>> table.
>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value of
>>>>>>>>> this
>>>>>>>>> > configuration from `true` to `false` to use Spark native tables
>>>>>>>>> because
>>>>>>>>> > we support better.
>>>>>>>>> >
>>>>>>>>> > In other words, Spark will use the value of
>>>>>>>>> `spark.sql.sources.default`
>>>>>>>>> > as the table provider instead of `Hive` like the other Spark
>>>>>>>>> APIs. Of course,
>>>>>>>>> > the users can get all the legacy behavior by setting back to
>>>>>>>>> `true`.
>>>>>>>>> >
>>>>>>>>> > Historically, this behavior change was merged once at Apache
>>>>>>>>> Spark 3.0.0
>>>>>>>>> > preparation via SPARK-30098 already, but reverted during the
>>>>>>>>> 3.0.0 RC period.
>>>>>>>>> >
>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for
>>>>>>>>> CREATE TABLE
>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default
>>>>>>>>> datasource as
>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>> >
>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this and
>>>>>>>>> defined it
>>>>>>>>> > as one of legacy behavior via this configuration via reused ID,
>>>>>>>>> SPARK-30098.
>>>>>>>>> >
>>>>>>>>> > 2020-12-01:
>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default
>>>>>>>>> datasource as
>>>>>>>>> >             provider for CREATE TABLE command
>>>>>>>>> >
>>>>>>>>> > Last year, we received two additional requests twice to switch
>>>>>>>>> this because
>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the
>>>>>>>>> future direction.
>>>>>>>>> >
>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>>>>>>>> >
>>>>>>>>> >
>>>>>>>>> > WDYT? The technical scope is defined in the following PR which
>>>>>>>>> is one line of main
>>>>>>>>> > code, one line of migration guide, and a few lines of test code.
>>>>>>>>> >
>>>>>>>>> > - https://github.com/apache/spark/pull/46207
>>>>>>>>> >
>>>>>>>>> > Dongjoon.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe e-mail: [email protected]
>>>>>>>>>
>>>>>>>>>

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

Reply via email to