Re: Which version of spark version supports parquet version 2 ?

2024-04-26 Thread Prem Sahoo
Confirmed, closing this . Thanks everyone for valuable information. Sent from my iPhone > On Apr 25, 2024, at 9:55 AM, Prem Sahoo wrote: > >  > Hello Spark , > After discussing with the Parquet and Pyarrow community . We can use the > below config so that Spark can write Parquet V2 files. >

Re: Which version of spark version supports parquet version 2 ?

2024-04-25 Thread Prem Sahoo
Hello Spark , After discussing with the Parquet and Pyarrow community . We can use the below config so that Spark can write Parquet V2 files. *"hadoopConfiguration.set(“parquet.writer.version”, “v2”)" while creating Parquet then those are V2 parquet.* *Could you please confirm ?* >

Re: Which version of spark version supports parquet version 2 ?

2024-04-19 Thread Steve Loughran
Those are some quite good improvements -but committing to storing all your data in an unstable format, is, well, "bold". For temporary data as part of a workflow though, it could be appealing Now, assuming you are going to be working with s3, you might want to start with merging PARQUET-2117 into

Re: Which version of spark version supports parquet version 2 ?

2024-04-18 Thread Prem Sahoo
Thanks for below information. Sent from my iPhoneOn Apr 18, 2024, at 3:31 AM, Bjørn Jørgensen wrote:" Release 24.3 of Dremio will continue to write Parquet V1, since an average performance degradation of 1.5% was observed in writes and 6.5% was observed in queries when TPC-DS data was written

Re: Which version of spark version supports parquet version 2 ?

2024-04-18 Thread Bjørn Jørgensen
" *Release 24.3 of Dremio will continue to write Parquet V1, since an average performance degradation of 1.5% was observed in writes and 6.5% was observed in queries when TPC-DS data was written using Parquet V2 instead of Parquet V1. The aforementioned query performance tests utilized the C3

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Mich Talebzadeh
Hi Prem, Your question about writing Parquet v2 with Spark 3.2.0. Spark 3.2.0 Limitations: Spark 3.2.0 doesn't have a built-in way to explicitly force Parquet v2 encoding. As we saw previously, even Spark 3.4 created a file with parquet-mr version, indicating v1 encoding. Dremio v2 Support: As

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Prem Sahoo
Hello Ryan, May I know how you can write Parquet V2 encoding from spark 3.2.0 ? As per my knowledge Dremio is creating and reading Parquet V2. "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted by engines that write Parquet data, supports delta encodings. However, these

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Ryan Blue
Prem, as I said earlier, v2 is not a finalized spec so you should not use it. That's why it is not the default. You can get Spark to write v2 files, but it isn't recommended by the Parquet community. On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo wrote: > Hello Community, > Could anyone shed more

Re: Which version of spark version supports parquet version 2 ?

2024-04-17 Thread Prem Sahoo
Hello Community, Could anyone shed more light on this (Spark Supporting Parquet V2)? On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh wrote: > Hi Prem, > > Regrettably this is not my area of speciality. I trust another colleague > will have a more informed idea. Alternatively you may raise an

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Mich Talebzadeh
Hi Prem, Regrettably this is not my area of speciality. I trust another colleague will have a more informed idea. Alternatively you may raise an SPIP for it. Spark Project Improvement Proposals (SPIP) | Apache Spark HTH Mich Talebzadeh,

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Prem Sahoo
Hello Mich,Thanks for example.I have the same parquet-mr version which creates Parquet version 1. We need to create V2 as it is more optimized. We have Dremio where if we use Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better in case of write . so we are inclined towards

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Mich Talebzadeh
Well let us do a test in PySpark. Take this code and create a default parquet file. My spark is 3.4 cat parquet_checxk.py from pyspark.sql import SparkSession spark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate() data = [("London", 8974432), ("New York City", 8804348),

Re: Which version of spark version supports parquet version 2 ?

2024-04-16 Thread Prem Sahoo
Hello Community,Could any of you shed some light on below questions please ?Sent from my iPhoneOn Apr 15, 2024, at 9:02 PM, Prem Sahoo wrote:Any specific reason spark does not support or community doesn't want to go to Parquet V2 , which is more optimized and read and write is too much faster

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Any specific reason spark does not support or community doesn't want to go to Parquet V2 , which is more optimized and read and write is too much faster (form other component which I am using) On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue wrote: > Spark will read data written with v2 encodings just

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue
Spark will read data written with v2 encodings just fine. You just don't need to worry about making Spark produce v2. And you should probably also not produce v2 encodings from other systems. On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo wrote: > oops but so spark does not support parquet V2 atm

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
oops but so spark does not support parquet V2 atm ?, as We have a use case where we need parquet V2 as one of our components uses Parquet V2 . On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue wrote: > Hi Prem, > > Parquet v1 is the default because v2 has not been finalized and adopted by > the

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Ryan Blue
Hi Prem, Parquet v1 is the default because v2 has not been finalized and adopted by the community. I highly recommend not using v2 encodings at this time. Ryan On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo wrote: > I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1 > which

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
I am using spark 3.2.0 . but my spark package comes with parquet-mr 1.2.1 which writes in parquet version 1 not version version 2:(. so I was looking how to write in Parquet version2 ? On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh wrote: > Sorry you have a point there. It was released in

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Sorry you have a point there. It was released in version 3.00. What version of spark are you using? Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Thank you so much for the info! But do we have any release notes where it says spark2.4.0 onwards supports parquet version 2. I was under the impression Spark3.0 onwards it started supporting . On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh wrote: > Well if I am correct, Parquet version 2

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Well if I am correct, Parquet version 2 support was introduced in Spark version 2.4.0. Therefore, any version of Spark starting from 2.4.0 supports Parquet version 2. Assuming that you are using Spark version 2.4.0 or later, you should be able to take advantage of Parquet version 2 features. HTH

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Thank you for the information! I can use any version of parquet-mr to produce parquet file. regarding 2nd question . Which version of spark is supporting parquet version 2? May I get the release notes where parquet versions are mentioned ? On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh wrote:

Re: Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Mich Talebzadeh
Parquet-mr is a Java library that provides functionality for working with Parquet files with hadoop. It is therefore more geared towards working with Parquet files within the Hadoop ecosystem, particularly using MapReduce jobs. There is no definitive way to check exact compatible versions within

Which version of spark version supports parquet version 2 ?

2024-04-15 Thread Prem Sahoo
Hello Team, May I know how to check which version of parquet is supported by parquet-mr 1.2.1 ? Which version of parquet-mr is supporting parquet version 2 (V2) ? Which version of spark is supporting parquet version 2? May I get the release notes where parquet versions are mentioned ?