With regard to your point below

"The thing I'm missing is this: let's say that the output format I choose
is delta lake or iceberg or whatever format that uses parquet. Where does
the catalog implementation (which holds metadata afaik, same metadata that
iceberg and delta lake save for their tables about their columns) comes
into play and why should it affect performance? "

The catalog implementation comes into play regardless of the output format
chosen (Delta Lake, Iceberg, Parquet, etc.) because it is responsible for
managing metadata about the datasets, tables, schemas, and other objects
stored in aforementioned formats. Even though Delta Lake and Iceberg have
their metadata management mechanisms internally, they still rely on the
catalog for providing a unified interface for accessing and manipulating
metadata across different storage formats.

"Another thing is that if I understand correctly, and I might be totally
wrong here, the internal spark catalog is a local installation of hive
metastore anyway, so I'm not sure what the catalog has to do with anything"

.I don't understand this. Do you mean a Derby database?

HTH

Mich Talebzadeh,
Technologist | Architect | Data Engineer  | Generative AI | FinCrime
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com> wrote:

> Thanks for the detailed answer.
> The thing I'm missing is this: let's say that the output format I choose
> is delta lake or iceberg or whatever format that uses parquet. Where does
> the catalog implementation (which holds metadata afaik, same metadata that
> iceberg and delta lake save for their tables about their columns) comes
> into play and why should it affect performance?
> Another thing is that if I understand correctly, and I might be totally
> wrong here, the internal spark catalog is a local installation of hive
> metastore anyway, so I'm not sure what the catalog has to do with anything.
>
> Thanks!
>
>
> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh ‏<
> mich.talebza...@gmail.com>:
>
>> My take regarding your question is that your mileage varies so to speak.
>>
>> 1) Hive provides a more mature and widely adopted catalog solution that
>> integrates well with other components in the Hadoop ecosystem, such as
>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using
>> Hive may offer better compatibility and interoperability.
>> 2) Hive provides a SQL-like interface that is familiar to users who are
>> accustomed to traditional RDBMs. If your use case involves complex SQL
>> queries or existing SQL-based workflows, using Hive may be advantageous.
>> 3) If you are looking for performance, spark's native catalog tends to
>> offer better performance for certain workloads, particularly those that
>> involve iterative processing or complex data transformations.(my
>> understanding). Spark's in-memory processing capabilities and optimizations
>> make it well-suited for interactive analytics and machine learning
>> tasks.(my favourite)
>> 4) Integration with Spark Workflows: If you primarily use Spark for data
>> processing and analytics, using Spark's native catalog may simplify
>> workflow management and reduce overhead, Spark's  tight integration with
>> its catalog allows for seamless interaction with Spark applications and
>> libraries.
>> 5) There seems to be some similarity with spark catalog and
>> Databricks unity catalog, so that may favour the choice.
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>>
>>> I will also appreciate some material that describes the differences
>>> between Spark native tables vs hive tables and why each should be used...
>>>
>>> Thanks
>>> Nimrod
>>>
>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏<
>>> mich.talebza...@gmail.com>:
>>>
>>>> I see a statement made as below  and I quote
>>>>
>>>> "The proposal of SPARK-46122 is to switch the default value of this
>>>> configuration from `true` to `false` to use Spark native tables because
>>>> we support better."
>>>>
>>>> Can you please elaborate on the above specifically with regard to the
>>>> phrase ".. because
>>>> we support better."
>>>>
>>>> Are you referring to the performance of Spark catalog (I believe it is
>>>> internal) or integration with Spark?
>>>>
>>>> HTH
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist | Architect | Data Engineer  | Generative AI | FinCrime
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> wrote:
>>>>>
>>>>>> +1
>>>>>>
>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444.
>>>>>>
>>>>>> Thanks,
>>>>>> Kent Yao
>>>>>>
>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 14:39写道:
>>>>>> >
>>>>>> > Hi, All.
>>>>>> >
>>>>>> > It's great to see community activities to polish 4.0.0 more and
>>>>>> more.
>>>>>> > Thank you all.
>>>>>> >
>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from the
>>>>>> subtasks
>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0),
>>>>>> >
>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122
>>>>>> >    Set `spark.sql.legacy.createHiveTableByDefault` to `false` by
>>>>>> default
>>>>>> >
>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax without
>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive` table.
>>>>>> > The proposal of SPARK-46122 is to switch the default value of this
>>>>>> > configuration from `true` to `false` to use Spark native tables
>>>>>> because
>>>>>> > we support better.
>>>>>> >
>>>>>> > In other words, Spark will use the value of
>>>>>> `spark.sql.sources.default`
>>>>>> > as the table provider instead of `Hive` like the other Spark APIs.
>>>>>> Of course,
>>>>>> > the users can get all the legacy behavior by setting back to `true`.
>>>>>> >
>>>>>> > Historically, this behavior change was merged once at Apache Spark
>>>>>> 3.0.0
>>>>>> > preparation via SPARK-30098 already, but reverted during the 3.0.0
>>>>>> RC period.
>>>>>> >
>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for
>>>>>> CREATE TABLE
>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource as
>>>>>> >             provider for CREATE TABLE command
>>>>>> >
>>>>>> > At Apache Spark 3.1.0, we had another discussion about this and
>>>>>> defined it
>>>>>> > as one of legacy behavior via this configuration via reused ID,
>>>>>> SPARK-30098.
>>>>>> >
>>>>>> > 2020-12-01:
>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204
>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default
>>>>>> datasource as
>>>>>> >             provider for CREATE TABLE command
>>>>>> >
>>>>>> > Last year, we received two additional requests twice to switch this
>>>>>> because
>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the future
>>>>>> direction.
>>>>>> >
>>>>>> > 2023-02-27: SPARK-42603 as an independent idea.
>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea
>>>>>> >
>>>>>> >
>>>>>> > WDYT? The technical scope is defined in the following PR which is
>>>>>> one line of main
>>>>>> > code, one line of migration guide, and a few lines of test code.
>>>>>> >
>>>>>> > - https://github.com/apache/spark/pull/46207
>>>>>> >
>>>>>> > Dongjoon.
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>>>
>>>>>>

Reply via email to