Well, I will be surprised because Derby database is single threaded and won't be much of a use here.
Most Hive metastore in the commercial world utilise postgres or Oracle for metastore that are battle proven, replicated and backed up. Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com> wrote: > Yes, in memory hive catalog backed by local Derby DB. > And again, I presume that most metadata related parts are during planning > and not actual run, so I don't see why it should strongly affect query > performance. > > Thanks, > > > בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh < > mich.talebza...@gmail.com>: > >> With regard to your point below >> >> "The thing I'm missing is this: let's say that the output format I choose >> is delta lake or iceberg or whatever format that uses parquet. Where does >> the catalog implementation (which holds metadata afaik, same metadata that >> iceberg and delta lake save for their tables about their columns) comes >> into play and why should it affect performance? " >> >> The catalog implementation comes into play regardless of the output >> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is >> responsible for managing metadata about the datasets, tables, schemas, and >> other objects stored in aforementioned formats. Even though Delta Lake and >> Iceberg have their metadata management mechanisms internally, they still >> rely on the catalog for providing a unified interface for accessing and >> manipulating metadata across different storage formats. >> >> "Another thing is that if I understand correctly, and I might be totally >> wrong here, the internal spark catalog is a local installation of hive >> metastore anyway, so I'm not sure what the catalog has to do with anything" >> >> .I don't understand this. Do you mean a Derby database? >> >> HTH >> >> Mich Talebzadeh, >> Technologist | Architect | Data Engineer | Generative AI | FinCrime >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* The information provided is correct to the best of my >> knowledge but of course cannot be guaranteed . It is essential to note >> that, as with any advice, quote "one test result is worth one-thousand >> expert opinions (Werner >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >> >> >> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com> wrote: >> >>> Thanks for the detailed answer. >>> The thing I'm missing is this: let's say that the output format I choose >>> is delta lake or iceberg or whatever format that uses parquet. Where does >>> the catalog implementation (which holds metadata afaik, same metadata that >>> iceberg and delta lake save for their tables about their columns) comes >>> into play and why should it affect performance? >>> Another thing is that if I understand correctly, and I might be totally >>> wrong here, the internal spark catalog is a local installation of hive >>> metastore anyway, so I'm not sure what the catalog has to do with anything. >>> >>> Thanks! >>> >>> >>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh < >>> mich.talebza...@gmail.com>: >>> >>>> My take regarding your question is that your mileage varies so to >>>> speak. >>>> >>>> 1) Hive provides a more mature and widely adopted catalog solution that >>>> integrates well with other components in the Hadoop ecosystem, such as >>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using >>>> Hive may offer better compatibility and interoperability. >>>> 2) Hive provides a SQL-like interface that is familiar to users who are >>>> accustomed to traditional RDBMs. If your use case involves complex SQL >>>> queries or existing SQL-based workflows, using Hive may be advantageous. >>>> 3) If you are looking for performance, spark's native catalog tends to >>>> offer better performance for certain workloads, particularly those that >>>> involve iterative processing or complex data transformations.(my >>>> understanding). Spark's in-memory processing capabilities and optimizations >>>> make it well-suited for interactive analytics and machine learning >>>> tasks.(my favourite) >>>> 4) Integration with Spark Workflows: If you primarily use Spark for >>>> data processing and analytics, using Spark's native catalog may simplify >>>> workflow management and reduce overhead, Spark's tight integration with >>>> its catalog allows for seamless interaction with Spark applications and >>>> libraries. >>>> 5) There seems to be some similarity with spark catalog and >>>> Databricks unity catalog, so that may favour the choice. >>>> >>>> HTH >>>> >>>> Mich Talebzadeh, >>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* The information provided is correct to the best of my >>>> knowledge but of course cannot be guaranteed . It is essential to note >>>> that, as with any advice, quote "one test result is worth one-thousand >>>> expert opinions (Werner >>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>> >>>> >>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com> >>>> wrote: >>>> >>>>> I will also appreciate some material that describes the differences >>>>> between Spark native tables vs hive tables and why each should be used... >>>>> >>>>> Thanks >>>>> Nimrod >>>>> >>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh < >>>>> mich.talebza...@gmail.com>: >>>>> >>>>>> I see a statement made as below and I quote >>>>>> >>>>>> "The proposal of SPARK-46122 is to switch the default value of this >>>>>> configuration from `true` to `false` to use Spark native tables >>>>>> because >>>>>> we support better." >>>>>> >>>>>> Can you please elaborate on the above specifically with regard to the >>>>>> phrase ".. because >>>>>> we support better." >>>>>> >>>>>> Are you referring to the performance of Spark catalog (I believe it >>>>>> is internal) or integration with Spark? >>>>>> >>>>>> HTH >>>>>> >>>>>> Mich Talebzadeh, >>>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>>> London >>>>>> United Kingdom >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* The information provided is correct to the best of my >>>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>> expert opinions (Werner >>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>> >>>>>> >>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> +1 >>>>>>> >>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> wrote: >>>>>>> >>>>>>>> +1 >>>>>>>> >>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Kent Yao >>>>>>>> >>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 14:39写道: >>>>>>>> > >>>>>>>> > Hi, All. >>>>>>>> > >>>>>>>> > It's great to see community activities to polish 4.0.0 more and >>>>>>>> more. >>>>>>>> > Thank you all. >>>>>>>> > >>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you from the >>>>>>>> subtasks >>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0), >>>>>>>> > >>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122 >>>>>>>> > Set `spark.sql.legacy.createHiveTableByDefault` to `false` by >>>>>>>> default >>>>>>>> > >>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax >>>>>>>> without >>>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive` >>>>>>>> table. >>>>>>>> > The proposal of SPARK-46122 is to switch the default value of this >>>>>>>> > configuration from `true` to `false` to use Spark native tables >>>>>>>> because >>>>>>>> > we support better. >>>>>>>> > >>>>>>>> > In other words, Spark will use the value of >>>>>>>> `spark.sql.sources.default` >>>>>>>> > as the table provider instead of `Hive` like the other Spark >>>>>>>> APIs. Of course, >>>>>>>> > the users can get all the legacy behavior by setting back to >>>>>>>> `true`. >>>>>>>> > >>>>>>>> > Historically, this behavior change was merged once at Apache >>>>>>>> Spark 3.0.0 >>>>>>>> > preparation via SPARK-30098 already, but reverted during the >>>>>>>> 3.0.0 RC period. >>>>>>>> > >>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider for >>>>>>>> CREATE TABLE >>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default datasource >>>>>>>> as >>>>>>>> > provider for CREATE TABLE command >>>>>>>> > >>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this and >>>>>>>> defined it >>>>>>>> > as one of legacy behavior via this configuration via reused ID, >>>>>>>> SPARK-30098. >>>>>>>> > >>>>>>>> > 2020-12-01: >>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 >>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default >>>>>>>> datasource as >>>>>>>> > provider for CREATE TABLE command >>>>>>>> > >>>>>>>> > Last year, we received two additional requests twice to switch >>>>>>>> this because >>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the >>>>>>>> future direction. >>>>>>>> > >>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea. >>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea >>>>>>>> > >>>>>>>> > >>>>>>>> > WDYT? The technical scope is defined in the following PR which is >>>>>>>> one line of main >>>>>>>> > code, one line of migration guide, and a few lines of test code. >>>>>>>> > >>>>>>>> > - https://github.com/apache/spark/pull/46207 >>>>>>>> > >>>>>>>> > Dongjoon. >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>> >>>>>>>>