Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread L. C. Hsieh
+1 On Thu, Apr 25, 2024 at 8:16 PM Yuming Wang wrote: > +1 > > On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek wrote: > >> Of course, I can't think of a scenario of thousands of tables with single >> in memory Spark cluster with in memory catalog. >> Thanks for the help! >> >> בתאריך יום ה׳, 25

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Yuming Wang
+1 On Fri, Apr 26, 2024 at 8:25 AM Nimrod Ofek wrote: > Of course, I can't think of a scenario of thousands of tables with single > in memory Spark cluster with in memory catalog. > Thanks for the help! > > בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏< >

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Denny Lee
+1 (non-binding) On Thu, Apr 25, 2024 at 19:26 Xinrong Meng wrote: > +1 > > On Thu, Apr 25, 2024 at 2:08 PM Holden Karau > wrote: > >> +1 >> >> Twitter: https://twitter.com/holdenkarau >> Books (Learning Spark, High Performance Spark, etc.): >> https://amzn.to/2MaRAG9

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Xinrong Meng
+1 On Thu, Apr 25, 2024 at 2:08 PM Holden Karau wrote: > +1 > > Twitter: https://twitter.com/holdenkarau > Books (Learning Spark, High Performance Spark, etc.): > https://amzn.to/2MaRAG9 > YouTube Live Streams: https://www.youtube.com/user/holdenkarau > > > On Thu,

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Of course, I can't think of a scenario of thousands of tables with single in memory Spark cluster with in memory catalog. Thanks for the help! בתאריך יום ה׳, 25 באפר׳ 2024, 23:56, מאת Mich Talebzadeh ‏< mich.talebza...@gmail.com>: > > > Agreed. In scenarios where most of the interactions with

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
ok thanks got it Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
Agreed. In scenarios where most of the interactions with the catalog are related to query planning, saving and metadata management, the choice of catalog implementation may have less impact on query runtime performance. This is because the time spent on metadata operations is generally minimal

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread L. C. Hsieh
+1 On Thu, Apr 25, 2024 at 11:19 AM Maciej wrote: > > +1 > > Best regards, > Maciej Szymkiewicz > > Web: https://zero323.net > PGP: A30CEF0C31A501EC > > On 4/25/24 6:21 PM, Reynold Xin wrote: > > +1 > > On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale > wrote: >> >> +1 >> >> On Thu, Apr 25,

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Holden Karau
+1 Twitter: https://twitter.com/holdenkarau Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 YouTube Live Streams: https://www.youtube.com/user/holdenkarau On Thu, Apr 25, 2024 at 11:18 AM Maciej wrote: > +1 > > Best regards, > Maciej

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Maciej
+1 Best regards, Maciej Szymkiewicz Web:https://zero323.net PGP: A30CEF0C31A501EC On 4/25/24 6:21 PM, Reynold Xin wrote: +1 On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale wrote: +1 On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun wrote: FYI, there is a proposal to drop

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Reynold Xin
+1 On Thu, Apr 25, 2024 at 9:01 AM Santosh Pingale wrote: > +1 > > On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun > wrote: > >> FYI, there is a proposal to drop Python 3.8 because its EOL is October >> 2024. >> >> https://github.com/apache/spark/pull/46228 >> [SPARK-47993][PYTHON] Drop Python 3.8

Re: [FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Santosh Pingale
+1 On Thu, Apr 25, 2024, 5:41 PM Dongjoon Hyun wrote: > FYI, there is a proposal to drop Python 3.8 because its EOL is October > 2024. > > https://github.com/apache/spark/pull/46228 > [SPARK-47993][PYTHON] Drop Python 3.8 > > Since it's still alive and there will be an overlap between the

[FYI] SPARK-47993: Drop Python 3.8

2024-04-25 Thread Dongjoon Hyun
FYI, there is a proposal to drop Python 3.8 because its EOL is October 2024. https://github.com/apache/spark/pull/46228 [SPARK-47993][PYTHON] Drop Python 3.8 Since it's still alive and there will be an overlap between the lifecycle of Python 3.8 and Apache Spark 4.0.0, please give us your

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Of course, but it's in memory and not persisted which is much faster, and as I said- I believe that most of the interaction with it is during the planning and save and not actual query run operations, and they are short and minimal compared to data fetching and manipulation so I don't believe it

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
Well, I will be surprised because Derby database is single threaded and won't be much of a use here. Most Hive metastore in the commercial world utilise postgres or Oracle for metastore that are battle proven, replicated and backed up. Mich Talebzadeh, Technologist | Architect | Data Engineer |

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Yes, in memory hive catalog backed by local Derby DB. And again, I presume that most metadata related parts are during planning and not actual run, so I don't see why it should strongly affect query performance. Thanks, בתאריך יום ה׳, 25 באפר׳ 2024, 17:29, מאת Mich Talebzadeh ‏<

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
With regard to your point below "The thing I'm missing is this: let's say that the output format I choose is delta lake or iceberg or whatever format that uses parquet. Where does the catalog implementation (which holds metadata afaik, same metadata that iceberg and delta lake save for their

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan
It's for the data source. For example, Spark's built-in Parquet reader/writer is faster than the Hive serde Parquet reader/writer. On Thu, Apr 25, 2024 at 9:55 PM Mich Talebzadeh wrote: > I see a statement made as below and I quote > > "The proposal of SPARK-46122 is to switch the default

Re: Which version of spark version supports parquet version 2 ?

2024-04-25 Thread Prem Sahoo
Hello Spark , After discussing with the Parquet and Pyarrow community . We can use the below config so that Spark can write Parquet V2 files. *"hadoopConfiguration.set(“parquet.writer.version”, “v2”)" while creating Parquet then those are V2 parquet.* *Could you please confirm ?* >

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
Thanks for the detailed answer. The thing I'm missing is this: let's say that the output format I choose is delta lake or iceberg or whatever format that uses parquet. Where does the catalog implementation (which holds metadata afaik, same metadata that iceberg and delta lake save for their tables

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
My take regarding your question is that your mileage varies so to speak. 1) Hive provides a more mature and widely adopted catalog solution that integrates well with other components in the Hadoop ecosystem, such as HDFS, HBase, and YARN. IIf you are Hadoop centric S(say on-premise), using Hive

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Nimrod Ofek
I will also appreciate some material that describes the differences between Spark native tables vs hive tables and why each should be used... Thanks Nimrod בתאריך יום ה׳, 25 באפר׳ 2024, 14:27, מאת Mich Talebzadeh ‏< mich.talebza...@gmail.com>: > I see a statement made as below and I quote > >

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Mich Talebzadeh
I see a statement made as below and I quote "The proposal of SPARK-46122 is to switch the default value of this configuration from `true` to `false` to use Spark native tables because we support better." Can you please elaborate on the above specifically with regard to the phrase ".. because we

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Wenchen Fan
+1 On Thu, Apr 25, 2024 at 2:46 PM Kent Yao wrote: > +1 > > Nit: the umbrella ticket is SPARK-44111, not SPARK-4. > > Thanks, > Kent Yao > > Dongjoon Hyun 于2024年4月25日周四 14:39写道: > > > > Hi, All. > > > > It's great to see community activities to polish 4.0.0 more and more. > > Thank you

Re: [DISCUSS] SPARK-46122: Set spark.sql.legacy.createHiveTableByDefault to false

2024-04-25 Thread Kent Yao
+1 Nit: the umbrella ticket is SPARK-44111, not SPARK-4. Thanks, Kent Yao Dongjoon Hyun 于2024年4月25日周四 14:39写道: > > Hi, All. > > It's great to see community activities to polish 4.0.0 more and more. > Thank you all. > > I'd like to bring SPARK-46122 (another SQL topic) to you from the