Hi Prem, Regrettably this is not my area of speciality. I trust another colleague will have a more informed idea. Alternatively you may raise an SPIP for it.
Spark Project Improvement Proposals (SPIP) | Apache Spark <https://spark.apache.org/improvement-proposals.html> HTH Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Tue, 16 Apr 2024 at 18:17, Prem Sahoo <prem.re...@gmail.com> wrote: > Hello Mich, > Thanks for example. > I have the same parquet-mr version which creates Parquet version 1. We > need to create V2 as it is more optimized. We have Dremio where if we use > Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better > in case of write . so we are inclined towards this way. Please let us know > why Spark is not going towards Parquet V2 ? > Sent from my iPhone > > On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > > > Well let us do a test in PySpark. > > Take this code and create a default parquet file. My spark is 3.4 > > cat parquet_checxk.py > from pyspark.sql import SparkSession > > spark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate() > > data = [("London", 8974432), ("New York City", 8804348), ("Beijing", > 21893000)] > df = spark.createDataFrame(data, ["city", "population"]) > > df.write.mode("overwrite").parquet("parquet_example") # it create file > in hdfs directory > > Use a tool called parquet-tools (downloadable using pip from > https://pypi.org/project/parquet-tools/) > > Get the parquet files from hdfs to the current directory say > > hdfs dfs -get /user/hduser/parquet_example . > cd ./parquet_example > do an ls and pickup file 3 like below to inspect > parquet-tools inspect > part-00003-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet > > Now this is the output > > ############ file meta data ############ > created_by: parquet-mr version 1.12.3 (build > f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b) > num_columns: 2 > num_rows: 1 > num_row_groups: 1 > format_version: 1.0 > serialized_size: 563 > > > ############ Columns ############ > name > age > > ############ Column(name) ############ > name: name > path: name > max_definition_level: 1 > max_repetition_level: 0 > physical_type: BYTE_ARRAY > logical_type: String > converted_type (legacy): UTF8 > compression: SNAPPY (space_saved: -5%) > > ############ Column(age) ############ > name: age > path: age > max_definition_level: 1 > max_repetition_level: 0 > physical_type: INT64 > logical_type: None > converted_type (legacy): NONE > compression: SNAPPY (space_saved: -5%) > > File Information: > > - format_version: 1.0: This line explicitly states that the format > version of the Parquet file is 1.0, which corresponds to Parquet version 1. > - created_by: parquet-mr version 1.12.3: While this doesn't directly > specify the format version, itt is accepted that older versions of > parquet-mr like 1.12.3 typically write Parquet version 1 files. > > Since in this case Spark 3.4 is capable of reading both versions (1 and > 2), you don't necessarily need to modify your Spark code to access this > file. However, if you want to create Parquet files in version 2 using > Spark, you might need to consider additional changes like excluding > parquet-mr or upgrading Parquet libraries and do a custom build.of Spark. > However, taking klaws of diminishing returns, I would not advise that > either.. You can ofcourse usse gzip for compression that may be more > suitable for your needs. > > HTH > > Mich Talebzadeh, > Technologist | Solutions Architect | Data Engineer | Generative AI > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Tue, 16 Apr 2024 at 15:00, Prem Sahoo <prem.re...@gmail.com> wrote: > >> Hello Community, >> Could any of you shed some light on below questions please ? >> Sent from my iPhone >> >> On Apr 15, 2024, at 9:02 PM, Prem Sahoo <prem.re...@gmail.com> wrote: >> >> >> Any specific reason spark does not support or community doesn't want to >> go to Parquet V2 , which is more optimized and read and write is too much >> faster (form other component which I am using) >> >> On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue <b...@tabular.io> wrote: >> >>> Spark will read data written with v2 encodings just fine. You just don't >>> need to worry about making Spark produce v2. And you should probably also >>> not produce v2 encodings from other systems. >>> >>> On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo <prem.re...@gmail.com> wrote: >>> >>>> oops but so spark does not support parquet V2 atm ?, as We have a use >>>> case where we need parquet V2 as one of our components uses Parquet V2 . >>>> >>>> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue <b...@tabular.io> wrote: >>>> >>>>> Hi Prem, >>>>> >>>>> Parquet v1 is the default because v2 has not been finalized and >>>>> adopted by the community. I highly recommend not using v2 encodings at >>>>> this >>>>> time. >>>>> >>>>> Ryan >>>>> >>>>> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo <prem.re...@gmail.com> >>>>> wrote: >>>>> >>>>>> I am using spark 3.2.0 . but my spark package comes with parquet-mr >>>>>> 1.2.1 which writes in parquet version 1 not version version 2:(. so I was >>>>>> looking how to write in Parquet version2 ? >>>>>> >>>>>> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh < >>>>>> mich.talebza...@gmail.com> wrote: >>>>>> >>>>>>> Sorry you have a point there. It was released in version 3.00. What >>>>>>> version of spark are you using? >>>>>>> >>>>>>> Technologist | Solutions Architect | Data Engineer | Generative AI >>>>>>> London >>>>>>> United Kingdom >>>>>>> >>>>>>> >>>>>>> view my Linkedin profile >>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>> >>>>>>> >>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>> >>>>>>> >>>>>>> >>>>>>> *Disclaimer:* The information provided is correct to the best of my >>>>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>> expert opinions (Werner >>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>> >>>>>>> >>>>>>> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo <prem.re...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thank you so much for the info! But do we have any release notes >>>>>>>> where it says spark2.4.0 onwards supports parquet version 2. I was >>>>>>>> under >>>>>>>> the impression Spark3.0 onwards it started supporting . >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh < >>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Well if I am correct, Parquet version 2 support was introduced in >>>>>>>>> Spark version 2.4.0. Therefore, any version of Spark starting from >>>>>>>>> 2.4.0 >>>>>>>>> supports Parquet version 2. Assuming that you are using Spark version >>>>>>>>> 2.4.0 or later, you should be able to take advantage of Parquet >>>>>>>>> version 2 >>>>>>>>> features. >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> Mich Talebzadeh, >>>>>>>>> Technologist | Solutions Architect | Data Engineer | Generative AI >>>>>>>>> London >>>>>>>>> United Kingdom >>>>>>>>> >>>>>>>>> >>>>>>>>> view my Linkedin profile >>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>> >>>>>>>>> >>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>>> note >>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>>>> expert opinions (Werner >>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo <prem.re...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thank you for the information! >>>>>>>>>> I can use any version of parquet-mr to produce parquet file. >>>>>>>>>> >>>>>>>>>> regarding 2nd question . >>>>>>>>>> Which version of spark is supporting parquet version 2? >>>>>>>>>> May I get the release notes where parquet versions are mentioned ? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh < >>>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Parquet-mr is a Java library that provides functionality for >>>>>>>>>>> working with Parquet files with hadoop. It is therefore more geared >>>>>>>>>>> towards working with Parquet files within the Hadoop ecosystem, >>>>>>>>>>> particularly using MapReduce jobs. There is no definitive way to >>>>>>>>>>> check >>>>>>>>>>> exact compatible versions within the library itself. However, you >>>>>>>>>>> can have >>>>>>>>>>> a look at this >>>>>>>>>>> >>>>>>>>>>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md >>>>>>>>>>> >>>>>>>>>>> HTH >>>>>>>>>>> >>>>>>>>>>> Mich Talebzadeh, >>>>>>>>>>> Technologist | Solutions Architect | Data Engineer | Generative >>>>>>>>>>> AI >>>>>>>>>>> London >>>>>>>>>>> United Kingdom >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> view my Linkedin profile >>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *Disclaimer:* The information provided is correct to the best >>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is >>>>>>>>>>> essential to >>>>>>>>>>> note that, as with any advice, quote "one test result is worth >>>>>>>>>>> one-thousand expert opinions (Werner >>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo <prem.re...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello Team, >>>>>>>>>>>> May I know how to check which version of parquet is supported >>>>>>>>>>>> by parquet-mr 1.2.1 ? >>>>>>>>>>>> >>>>>>>>>>>> Which version of parquet-mr is supporting parquet version 2 >>>>>>>>>>>> (V2) ? >>>>>>>>>>>> >>>>>>>>>>>> Which version of spark is supporting parquet version 2? >>>>>>>>>>>> May I get the release notes where parquet versions are >>>>>>>>>>>> mentioned ? >>>>>>>>>>>> >>>>>>>>>>> >>>>> >>>>> -- >>>>> Ryan Blue >>>>> Tabular >>>>> >>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >>