Thank you, Kent, Wenchen, Mich, Nimrod, Yuming, LiangChi. I'll start a vote.
To Mich, for your question, Apache Spark has a long history of converting Hive-provider tables into Spark's datasource tables to handle better in a Spark way. > Can you please elaborate on the above specifically with regard to the phrase > ".. because we support better." Here are the subset of configurations you can take a look at. - spark.sql.hive.convertMetastoreParquet (`true` since Spark 1.3.0) - spark.sql.hive.convertMetastoreOrc (`true` since Spark 2.4.0) - spark.sql.hive.convertInsertingPartitionedTable (`true` since Spark 3.0.0) - spark.sql.hive.convertMetastoreInsertDir (`true` since Spark 3.3.0) - spark.sql.hive.convertInsertingUnpartitionedTable (`true` since 4.0.0) Dongjoon. On Fri, Apr 26, 2024 at 12:24 AM L. C. Hsieh <vii...@gmail.com> wrote: > +1 > > On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang <yumw...@apache.org> wrote: > >> +1 >> >> On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek <ofek.nim...@gmail.com> >> wrote: >> >>> Of course, I can't think of a scenario of thousands of tables with >>> single in memory Spark cluster with in memory catalog. >>> Thanks for the help! >>> >>> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh < >>> mich.talebza...@gmail.com>: >>> >>>> >>>> >>>> Agreed. In scenarios where most of the interactions with the catalog >>>> are related to query planning, saving and metadata management, the choice >>>> of catalog implementation may have less impact on query runtime >>>> performance. >>>> This is because the time spent on metadata operations is generally >>>> minimal compared to the time spent on actual data fetching, processing, and >>>> computation. >>>> However, if we consider scalability and reliability concerns, >>>> especially as the size and complexity of data and query workload grow. >>>> While an in-memory catalog may offer excellent performance for smaller >>>> workloads, >>>> it will face limitations in handling larger-scale deployments with >>>> thousands of tables, partitions, and users. Additionally, durability and >>>> persistence are crucial considerations, particularly in production >>>> environments where data integrity >>>> and availability are crucial. In-memory catalog implementations may >>>> lack durability, meaning that metadata changes could be lost in the event >>>> of a system failure or restart. Therefore, while in-memory catalog >>>> implementations can provide speed and efficiency for certain use cases, we >>>> ought to consider the requirements for scalability, reliability, and data >>>> durability when choosing a catalog solution for production deployments. In >>>> many cases, a combination of in-memory and disk-based catalog solutions may >>>> offer the best balance of performance and resilience for demanding large >>>> scale workloads. >>>> >>>> >>>> HTH >>>> >>>> >>>> Mich Talebzadeh, >>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* The information provided is correct to the best of my >>>> knowledge but of course cannot be guaranteed . It is essential to note >>>> that, as with any advice, quote "one test result is worth one-thousand >>>> expert opinions (Werner >>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>> >>>> >>>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <ofek.nim...@gmail.com> >>>> wrote: >>>> >>>>> Of course, but it's in memory and not persisted which is much faster, >>>>> and as I said- I believe that most of the interaction with it is during >>>>> the >>>>> planning and save and not actual query run operations, and they are short >>>>> and minimal compared to data fetching and manipulation so I don't believe >>>>> it will have big impact on query run... >>>>> >>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh < >>>>> mich.talebza...@gmail.com>: >>>>> >>>>>> Well, I will be surprised because Derby database is single threaded >>>>>> and won't be much of a use here. >>>>>> >>>>>> Most Hive metastore in the commercial world utilise postgres or >>>>>> Oracle for metastore that are battle proven, replicated and backed up. >>>>>> >>>>>> Mich Talebzadeh, >>>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>>> London >>>>>> United Kingdom >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* The information provided is correct to the best of my >>>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>> expert opinions (Werner >>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>> >>>>>> >>>>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Yes, in memory hive catalog backed by local Derby DB. >>>>>>> And again, I presume that most metadata related parts are during >>>>>>> planning and not actual run, so I don't see why it should strongly >>>>>>> affect >>>>>>> query performance. >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> >>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com>: >>>>>>> >>>>>>>> With regard to your point below >>>>>>>> >>>>>>>> "The thing I'm missing is this: let's say that the output format I >>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. >>>>>>>> Where >>>>>>>> does the catalog implementation (which holds metadata afaik, same >>>>>>>> metadata >>>>>>>> that iceberg and delta lake save for their tables about their columns) >>>>>>>> comes into play and why should it affect performance? " >>>>>>>> >>>>>>>> The catalog implementation comes into play regardless of the output >>>>>>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is >>>>>>>> responsible for managing metadata about the datasets, tables, schemas, >>>>>>>> and >>>>>>>> other objects stored in aforementioned formats. Even though Delta Lake >>>>>>>> and >>>>>>>> Iceberg have their metadata management mechanisms internally, they >>>>>>>> still >>>>>>>> rely on the catalog for providing a unified interface for accessing and >>>>>>>> manipulating metadata across different storage formats. >>>>>>>> >>>>>>>> "Another thing is that if I understand correctly, and I might be >>>>>>>> totally wrong here, the internal spark catalog is a local installation >>>>>>>> of >>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with >>>>>>>> anything" >>>>>>>> >>>>>>>> .I don't understand this. Do you mean a Derby database? >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> Mich Talebzadeh, >>>>>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>>>>> London >>>>>>>> United Kingdom >>>>>>>> >>>>>>>> >>>>>>>> view my Linkedin profile >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> >>>>>>>> >>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>> note >>>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>>> expert opinions (Werner >>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>> >>>>>>>> >>>>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thanks for the detailed answer. >>>>>>>>> The thing I'm missing is this: let's say that the output format I >>>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. >>>>>>>>> Where >>>>>>>>> does the catalog implementation (which holds metadata afaik, same >>>>>>>>> metadata >>>>>>>>> that iceberg and delta lake save for their tables about their columns) >>>>>>>>> comes into play and why should it affect performance? >>>>>>>>> Another thing is that if I understand correctly, and I might be >>>>>>>>> totally wrong here, the internal spark catalog is a local >>>>>>>>> installation of >>>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with >>>>>>>>> anything. >>>>>>>>> >>>>>>>>> Thanks! >>>>>>>>> >>>>>>>>> >>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh < >>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>> >>>>>>>>>> My take regarding your question is that your mileage varies so to >>>>>>>>>> speak. >>>>>>>>>> >>>>>>>>>> 1) Hive provides a more mature and widely adopted catalog >>>>>>>>>> solution that integrates well with other components in the Hadoop >>>>>>>>>> ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop centric >>>>>>>>>> S(say >>>>>>>>>> on-premise), using Hive may offer better compatibility and >>>>>>>>>> interoperability. >>>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users >>>>>>>>>> who are accustomed to traditional RDBMs. If your use case involves >>>>>>>>>> complex >>>>>>>>>> SQL queries or existing SQL-based workflows, using Hive may be >>>>>>>>>> advantageous. >>>>>>>>>> 3) If you are looking for performance, spark's native catalog >>>>>>>>>> tends to offer better performance for certain workloads, >>>>>>>>>> particularly those >>>>>>>>>> that involve iterative processing or complex data transformations.(my >>>>>>>>>> understanding). Spark's in-memory processing capabilities and >>>>>>>>>> optimizations >>>>>>>>>> make it well-suited for interactive analytics and machine learning >>>>>>>>>> tasks.(my favourite) >>>>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark >>>>>>>>>> for data processing and analytics, using Spark's native catalog may >>>>>>>>>> simplify workflow management and reduce overhead, Spark's tight >>>>>>>>>> integration with its catalog allows for seamless interaction with >>>>>>>>>> Spark >>>>>>>>>> applications and libraries. >>>>>>>>>> 5) There seems to be some similarity with spark catalog and >>>>>>>>>> Databricks unity catalog, so that may favour the choice. >>>>>>>>>> >>>>>>>>>> HTH >>>>>>>>>> >>>>>>>>>> Mich Talebzadeh, >>>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>>> FinCrime >>>>>>>>>> London >>>>>>>>>> United Kingdom >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> view my Linkedin profile >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>>>> note >>>>>>>>>> that, as with any advice, quote "one test result is worth >>>>>>>>>> one-thousand >>>>>>>>>> expert opinions (Werner >>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I will also appreciate some material that describes the >>>>>>>>>>> differences between Spark native tables vs hive tables and why each >>>>>>>>>>> should >>>>>>>>>>> be used... >>>>>>>>>>> >>>>>>>>>>> Thanks >>>>>>>>>>> Nimrod >>>>>>>>>>> >>>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh < >>>>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>>>> >>>>>>>>>>>> I see a statement made as below and I quote >>>>>>>>>>>> >>>>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of >>>>>>>>>>>> this >>>>>>>>>>>> configuration from `true` to `false` to use Spark native tables >>>>>>>>>>>> because >>>>>>>>>>>> we support better." >>>>>>>>>>>> >>>>>>>>>>>> Can you please elaborate on the above specifically with regard >>>>>>>>>>>> to the phrase ".. because >>>>>>>>>>>> we support better." >>>>>>>>>>>> >>>>>>>>>>>> Are you referring to the performance of Spark catalog (I >>>>>>>>>>>> believe it is internal) or integration with Spark? >>>>>>>>>>>> >>>>>>>>>>>> HTH >>>>>>>>>>>> >>>>>>>>>>>> Mich Talebzadeh, >>>>>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>>>>> FinCrime >>>>>>>>>>>> London >>>>>>>>>>>> United Kingdom >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> view my Linkedin profile >>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best >>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is >>>>>>>>>>>> essential to >>>>>>>>>>>> note that, as with any advice, quote "one test result is worth >>>>>>>>>>>> one-thousand expert opinions (Werner >>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> +1 >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> +1 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Kent Yao >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 >>>>>>>>>>>>>> 14:39写道: >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Hi, All. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > It's great to see community activities to polish 4.0.0 more >>>>>>>>>>>>>> and more. >>>>>>>>>>>>>> > Thank you all. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you >>>>>>>>>>>>>> from the subtasks >>>>>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0), >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122 >>>>>>>>>>>>>> > Set `spark.sql.legacy.createHiveTableByDefault` to >>>>>>>>>>>>>> `false` by default >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL >>>>>>>>>>>>>> syntax without >>>>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to >>>>>>>>>>>>>> `Hive` table. >>>>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value >>>>>>>>>>>>>> of this >>>>>>>>>>>>>> > configuration from `true` to `false` to use Spark native >>>>>>>>>>>>>> tables because >>>>>>>>>>>>>> > we support better. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > In other words, Spark will use the value of >>>>>>>>>>>>>> `spark.sql.sources.default` >>>>>>>>>>>>>> > as the table provider instead of `Hive` like the other >>>>>>>>>>>>>> Spark APIs. Of course, >>>>>>>>>>>>>> > the users can get all the legacy behavior by setting back >>>>>>>>>>>>>> to `true`. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Historically, this behavior change was merged once at >>>>>>>>>>>>>> Apache Spark 3.0.0 >>>>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during >>>>>>>>>>>>>> the 3.0.0 RC period. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider >>>>>>>>>>>>>> for CREATE TABLE >>>>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default >>>>>>>>>>>>>> datasource as >>>>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this >>>>>>>>>>>>>> and defined it >>>>>>>>>>>>>> > as one of legacy behavior via this configuration via reused >>>>>>>>>>>>>> ID, SPARK-30098. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > 2020-12-01: >>>>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 >>>>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default >>>>>>>>>>>>>> datasource as >>>>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Last year, we received two additional requests twice to >>>>>>>>>>>>>> switch this because >>>>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for >>>>>>>>>>>>>> the future direction. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea. >>>>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR >>>>>>>>>>>>>> which is one line of main >>>>>>>>>>>>>> > code, one line of migration guide, and a few lines of test >>>>>>>>>>>>>> code. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207 >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Dongjoon. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>>>>> >>>>>>>>>>>>>>