+1 On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang <yumw...@apache.org> wrote:
> +1 > > On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek <ofek.nim...@gmail.com> wrote: > >> Of course, I can't think of a scenario of thousands of tables with single >> in memory Spark cluster with in memory catalog. >> Thanks for the help! >> >> בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh < >> mich.talebza...@gmail.com>: >> >>> >>> >>> Agreed. In scenarios where most of the interactions with the catalog are >>> related to query planning, saving and metadata management, the choice of >>> catalog implementation may have less impact on query runtime performance. >>> This is because the time spent on metadata operations is generally >>> minimal compared to the time spent on actual data fetching, processing, and >>> computation. >>> However, if we consider scalability and reliability concerns, especially >>> as the size and complexity of data and query workload grow. While an >>> in-memory catalog may offer excellent performance for smaller workloads, >>> it will face limitations in handling larger-scale deployments with >>> thousands of tables, partitions, and users. Additionally, durability and >>> persistence are crucial considerations, particularly in production >>> environments where data integrity >>> and availability are crucial. In-memory catalog implementations may lack >>> durability, meaning that metadata changes could be lost in the event of a >>> system failure or restart. Therefore, while in-memory catalog >>> implementations can provide speed and efficiency for certain use cases, we >>> ought to consider the requirements for scalability, reliability, and data >>> durability when choosing a catalog solution for production deployments. In >>> many cases, a combination of in-memory and disk-based catalog solutions may >>> offer the best balance of performance and resilience for demanding large >>> scale workloads. >>> >>> >>> HTH >>> >>> >>> Mich Talebzadeh, >>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* The information provided is correct to the best of my >>> knowledge but of course cannot be guaranteed . It is essential to note >>> that, as with any advice, quote "one test result is worth one-thousand >>> expert opinions (Werner >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>> >>> >>> On Thu, 25 Apr 2024 at 16:32, Nimrod Ofek <ofek.nim...@gmail.com> wrote: >>> >>>> Of course, but it's in memory and not persisted which is much faster, >>>> and as I said- I believe that most of the interaction with it is during the >>>> planning and save and not actual query run operations, and they are short >>>> and minimal compared to data fetching and manipulation so I don't believe >>>> it will have big impact on query run... >>>> >>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:52, מאת Mich Talebzadeh < >>>> mich.talebza...@gmail.com>: >>>> >>>>> Well, I will be surprised because Derby database is single threaded >>>>> and won't be much of a use here. >>>>> >>>>> Most Hive metastore in the commercial world utilise postgres or Oracle >>>>> for metastore that are battle proven, replicated and backed up. >>>>> >>>>> Mich Talebzadeh, >>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> *Disclaimer:* The information provided is correct to the best of my >>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>> expert opinions (Werner >>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>> >>>>> >>>>> On Thu, 25 Apr 2024 at 15:39, Nimrod Ofek <ofek.nim...@gmail.com> >>>>> wrote: >>>>> >>>>>> Yes, in memory hive catalog backed by local Derby DB. >>>>>> And again, I presume that most metadata related parts are during >>>>>> planning and not actual run, so I don't see why it should strongly affect >>>>>> query performance. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> >>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com>: >>>>>> >>>>>>> With regard to your point below >>>>>>> >>>>>>> "The thing I'm missing is this: let's say that the output format I >>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. >>>>>>> Where >>>>>>> does the catalog implementation (which holds metadata afaik, same >>>>>>> metadata >>>>>>> that iceberg and delta lake save for their tables about their columns) >>>>>>> comes into play and why should it affect performance? " >>>>>>> >>>>>>> The catalog implementation comes into play regardless of the output >>>>>>> format chosen (Delta Lake, Iceberg, Parquet, etc.) because it is >>>>>>> responsible for managing metadata about the datasets, tables, schemas, >>>>>>> and >>>>>>> other objects stored in aforementioned formats. Even though Delta Lake >>>>>>> and >>>>>>> Iceberg have their metadata management mechanisms internally, they still >>>>>>> rely on the catalog for providing a unified interface for accessing and >>>>>>> manipulating metadata across different storage formats. >>>>>>> >>>>>>> "Another thing is that if I understand correctly, and I might be >>>>>>> totally wrong here, the internal spark catalog is a local installation >>>>>>> of >>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with >>>>>>> anything" >>>>>>> >>>>>>> .I don't understand this. Do you mean a Derby database? >>>>>>> >>>>>>> HTH >>>>>>> >>>>>>> Mich Talebzadeh, >>>>>>> Technologist | Architect | Data Engineer | Generative AI | FinCrime >>>>>>> London >>>>>>> United Kingdom >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* The information provided is correct to the best of my >>>>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>> expert opinions (Werner >>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>> >>>>>>> >>>>>>> On Thu, 25 Apr 2024 at 14:38, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for the detailed answer. >>>>>>>> The thing I'm missing is this: let's say that the output format I >>>>>>>> choose is delta lake or iceberg or whatever format that uses parquet. >>>>>>>> Where >>>>>>>> does the catalog implementation (which holds metadata afaik, same >>>>>>>> metadata >>>>>>>> that iceberg and delta lake save for their tables about their columns) >>>>>>>> comes into play and why should it affect performance? >>>>>>>> Another thing is that if I understand correctly, and I might be >>>>>>>> totally wrong here, the internal spark catalog is a local installation >>>>>>>> of >>>>>>>> hive metastore anyway, so I'm not sure what the catalog has to do with >>>>>>>> anything. >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> >>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 16:14, מאת Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com>: >>>>>>>> >>>>>>>>> My take regarding your question is that your mileage varies so to >>>>>>>>> speak. >>>>>>>>> >>>>>>>>> 1) Hive provides a more mature and widely adopted catalog solution >>>>>>>>> that integrates well with other components in the Hadoop ecosystem, >>>>>>>>> such as >>>>>>>>> HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), >>>>>>>>> using >>>>>>>>> Hive may offer better compatibility and interoperability. >>>>>>>>> 2) Hive provides a SQL-like interface that is familiar to users >>>>>>>>> who are accustomed to traditional RDBMs. If your use case involves >>>>>>>>> complex >>>>>>>>> SQL queries or existing SQL-based workflows, using Hive may be >>>>>>>>> advantageous. >>>>>>>>> 3) If you are looking for performance, spark's native catalog >>>>>>>>> tends to offer better performance for certain workloads, particularly >>>>>>>>> those >>>>>>>>> that involve iterative processing or complex data transformations.(my >>>>>>>>> understanding). Spark's in-memory processing capabilities and >>>>>>>>> optimizations >>>>>>>>> make it well-suited for interactive analytics and machine learning >>>>>>>>> tasks.(my favourite) >>>>>>>>> 4) Integration with Spark Workflows: If you primarily use Spark >>>>>>>>> for data processing and analytics, using Spark's native catalog may >>>>>>>>> simplify workflow management and reduce overhead, Spark's tight >>>>>>>>> integration with its catalog allows for seamless interaction with >>>>>>>>> Spark >>>>>>>>> applications and libraries. >>>>>>>>> 5) There seems to be some similarity with spark catalog and >>>>>>>>> Databricks unity catalog, so that may favour the choice. >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> Mich Talebzadeh, >>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>> FinCrime >>>>>>>>> London >>>>>>>>> United Kingdom >>>>>>>>> >>>>>>>>> >>>>>>>>> view my Linkedin profile >>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>>> note >>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>>>> expert opinions (Werner >>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, 25 Apr 2024 at 12:30, Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I will also appreciate some material that describes the >>>>>>>>>> differences between Spark native tables vs hive tables and why each >>>>>>>>>> should >>>>>>>>>> be used... >>>>>>>>>> >>>>>>>>>> Thanks >>>>>>>>>> Nimrod >>>>>>>>>> >>>>>>>>>> בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh < >>>>>>>>>> mich.talebza...@gmail.com>: >>>>>>>>>> >>>>>>>>>>> I see a statement made as below and I quote >>>>>>>>>>> >>>>>>>>>>> "The proposal of SPARK-46122 is to switch the default value of >>>>>>>>>>> this >>>>>>>>>>> configuration from `true` to `false` to use Spark native tables >>>>>>>>>>> because >>>>>>>>>>> we support better." >>>>>>>>>>> >>>>>>>>>>> Can you please elaborate on the above specifically with regard >>>>>>>>>>> to the phrase ".. because >>>>>>>>>>> we support better." >>>>>>>>>>> >>>>>>>>>>> Are you referring to the performance of Spark catalog (I believe >>>>>>>>>>> it is internal) or integration with Spark? >>>>>>>>>>> >>>>>>>>>>> HTH >>>>>>>>>>> >>>>>>>>>>> Mich Talebzadeh, >>>>>>>>>>> Technologist | Architect | Data Engineer | Generative AI | >>>>>>>>>>> FinCrime >>>>>>>>>>> London >>>>>>>>>>> United Kingdom >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> view my Linkedin profile >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Disclaimer:* The information provided is correct to the best >>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is >>>>>>>>>>> essential to >>>>>>>>>>> note that, as with any advice, quote "one test result is worth >>>>>>>>>>> one-thousand expert opinions (Werner >>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, 25 Apr 2024 at 11:17, Wenchen Fan <cloud0...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> +1 >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Apr 25, 2024 at 2:46 PM Kent Yao <y...@apache.org> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> +1 >>>>>>>>>>>>> >>>>>>>>>>>>> Nit: the umbrella ticket is SPARK-44111, not SPARK-44444. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> Kent Yao >>>>>>>>>>>>> >>>>>>>>>>>>> Dongjoon Hyun <dongjoon.h...@gmail.com> 于2024年4月25日周四 14:39写道: >>>>>>>>>>>>> > >>>>>>>>>>>>> > Hi, All. >>>>>>>>>>>>> > >>>>>>>>>>>>> > It's great to see community activities to polish 4.0.0 more >>>>>>>>>>>>> and more. >>>>>>>>>>>>> > Thank you all. >>>>>>>>>>>>> > >>>>>>>>>>>>> > I'd like to bring SPARK-46122 (another SQL topic) to you >>>>>>>>>>>>> from the subtasks >>>>>>>>>>>>> > of SPARK-44444 (Prepare Apache Spark 4.0.0), >>>>>>>>>>>>> > >>>>>>>>>>>>> > - https://issues.apache.org/jira/browse/SPARK-46122 >>>>>>>>>>>>> > Set `spark.sql.legacy.createHiveTableByDefault` to >>>>>>>>>>>>> `false` by default >>>>>>>>>>>>> > >>>>>>>>>>>>> > This legacy configuration is about `CREATE TABLE` SQL syntax >>>>>>>>>>>>> without >>>>>>>>>>>>> > `USING` and `STORED AS`, which is currently mapped to `Hive` >>>>>>>>>>>>> table. >>>>>>>>>>>>> > The proposal of SPARK-46122 is to switch the default value >>>>>>>>>>>>> of this >>>>>>>>>>>>> > configuration from `true` to `false` to use Spark native >>>>>>>>>>>>> tables because >>>>>>>>>>>>> > we support better. >>>>>>>>>>>>> > >>>>>>>>>>>>> > In other words, Spark will use the value of >>>>>>>>>>>>> `spark.sql.sources.default` >>>>>>>>>>>>> > as the table provider instead of `Hive` like the other Spark >>>>>>>>>>>>> APIs. Of course, >>>>>>>>>>>>> > the users can get all the legacy behavior by setting back to >>>>>>>>>>>>> `true`. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Historically, this behavior change was merged once at Apache >>>>>>>>>>>>> Spark 3.0.0 >>>>>>>>>>>>> > preparation via SPARK-30098 already, but reverted during the >>>>>>>>>>>>> 3.0.0 RC period. >>>>>>>>>>>>> > >>>>>>>>>>>>> > 2019-12-06: SPARK-30098 Use default datasource as provider >>>>>>>>>>>>> for CREATE TABLE >>>>>>>>>>>>> > 2020-05-16: SPARK-31707 Revert SPARK-30098 Use default >>>>>>>>>>>>> datasource as >>>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>>> > >>>>>>>>>>>>> > At Apache Spark 3.1.0, we had another discussion about this >>>>>>>>>>>>> and defined it >>>>>>>>>>>>> > as one of legacy behavior via this configuration via reused >>>>>>>>>>>>> ID, SPARK-30098. >>>>>>>>>>>>> > >>>>>>>>>>>>> > 2020-12-01: >>>>>>>>>>>>> https://lists.apache.org/thread/8c8k1jk61pzlcosz3mxo4rkj5l23r204 >>>>>>>>>>>>> > 2020-12-03: SPARK-30098 Add a configuration to use default >>>>>>>>>>>>> datasource as >>>>>>>>>>>>> > provider for CREATE TABLE command >>>>>>>>>>>>> > >>>>>>>>>>>>> > Last year, we received two additional requests twice to >>>>>>>>>>>>> switch this because >>>>>>>>>>>>> > Apache Spark 4.0.0 is a good time to make a decision for the >>>>>>>>>>>>> future direction. >>>>>>>>>>>>> > >>>>>>>>>>>>> > 2023-02-27: SPARK-42603 as an independent idea. >>>>>>>>>>>>> > 2023-11-27: SPARK-46122 as a part of Apache Spark 4.0.0 idea >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > WDYT? The technical scope is defined in the following PR >>>>>>>>>>>>> which is one line of main >>>>>>>>>>>>> > code, one line of migration guide, and a few lines of test >>>>>>>>>>>>> code. >>>>>>>>>>>>> > >>>>>>>>>>>>> > - https://github.com/apache/spark/pull/46207 >>>>>>>>>>>>> > >>>>>>>>>>>>> > Dongjoon. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> --------------------------------------------------------------------- >>>>>>>>>>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org >>>>>>>>>>>>> >>>>>>>>>>>>>