Hello Community, Could anyone shed more light on this (Spark Supporting Parquet V2)?
On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hi Prem, > > Regrettably this is not my area of speciality. I trust another colleague > will have a more informed idea. Alternatively you may raise an SPIP for it. > > Spark Project Improvement Proposals (SPIP) | Apache Spark > <https://spark.apache.org/improvement-proposals.html> > > HTH > > Mich Talebzadeh, > Technologist | Solutions Architect | Data Engineer | Generative AI > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Tue, 16 Apr 2024 at 18:17, Prem Sahoo <prem.re...@gmail.com> wrote: > >> Hello Mich, >> Thanks for example. >> I have the same parquet-mr version which creates Parquet version 1. We >> need to create V2 as it is more optimized. We have Dremio where if we use >> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better >> in case of write . so we are inclined towards this way. Please let us know >> why Spark is not going towards Parquet V2 ? >> Sent from my iPhone >> >> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >> >> Well let us do a test in PySpark. >> >> Take this code and create a default parquet file. My spark is 3.4 >> >> cat parquet_checxk.py >> from pyspark.sql import SparkSession >> >> spark = >> SparkSession.builder.appName("ParquetVersionExample").getOrCreate() >> >> data = [("London", 8974432), ("New York City", 8804348), ("Beijing", >> 21893000)] >> df = spark.createDataFrame(data, ["city", "population"]) >> >> df.write.mode("overwrite").parquet("parquet_example") # it create file >> in hdfs directory >> >> Use a tool called parquet-tools (downloadable using pip from >> https://pypi.org/project/parquet-tools/) >> >> Get the parquet files from hdfs to the current directory say >> >> hdfs dfs -get /user/hduser/parquet_example . >> cd ./parquet_example >> do an ls and pickup file 3 like below to inspect >> parquet-tools inspect >> part-00003-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet >> >> Now this is the output >> >> ############ file meta data ############ >> created_by: parquet-mr version 1.12.3 (build >> f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b) >> num_columns: 2 >> num_rows: 1 >> num_row_groups: 1 >> format_version: 1.0 >> serialized_size: 563 >> >> >> ############ Columns ############ >> name >> age >> >> ############ Column(name) ############ >> name: name >> path: name >> max_definition_level: 1 >> max_repetition_level: 0 >> physical_type: BYTE_ARRAY >> logical_type: String >> converted_type (legacy): UTF8 >> compression: SNAPPY (space_saved: -5%) >> >> ############ Column(age) ############ >> name: age >> path: age >> max_definition_level: 1 >> max_repetition_level: 0 >> physical_type: INT64 >> logical_type: None >> converted_type (legacy): NONE >> compression: SNAPPY (space_saved: -5%) >> >> File Information: >> >> - format_version: 1.0: This line explicitly states that the format >> version of the Parquet file is 1.0, which corresponds to Parquet version >> 1. >> - created_by: parquet-mr version 1.12.3: While this doesn't directly >> specify the format version, itt is accepted that older versions of >> parquet-mr like 1.12.3 typically write Parquet version 1 files. >> >> Since in this case Spark 3.4 is capable of reading both versions (1 and >> 2), you don't necessarily need to modify your Spark code to access this >> file. However, if you want to create Parquet files in version 2 using >> Spark, you might need to consider additional changes like excluding >> parquet-mr or upgrading Parquet libraries and do a custom build.of Spark. >> However, taking klaws of diminishing returns, I would not advise that >> either.. You can ofcourse usse gzip for compression that may be more >> suitable for your needs. >> >> HTH >> >> Mich Talebzadeh, >> Technologist | Solutions Architect | Data Engineer | Generative AI >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* The information provided is correct to the best of my >> knowledge but of course cannot be guaranteed . It is essential to note >> that, as with any advice, quote "one test result is worth one-thousand >> expert opinions (Werner >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >> >> >> On Tue, 16 Apr 2024 at 15:00, Prem Sahoo <prem.re...@gmail.com> wrote: >> >>> Hello Community, >>> Could any of you shed some light on below questions please ? >>> Sent from my iPhone >>> >>> On Apr 15, 2024, at 9:02 PM, Prem Sahoo <prem.re...@gmail.com> wrote: >>> >>> >>> Any specific reason spark does not support or community doesn't want to >>> go to Parquet V2 , which is more optimized and read and write is too much >>> faster (form other component which I am using) >>> >>> On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue <b...@tabular.io> wrote: >>> >>>> Spark will read data written with v2 encodings just fine. You just >>>> don't need to worry about making Spark produce v2. And you should probably >>>> also not produce v2 encodings from other systems. >>>> >>>> On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo <prem.re...@gmail.com> >>>> wrote: >>>> >>>>> oops but so spark does not support parquet V2 atm ?, as We have a use >>>>> case where we need parquet V2 as one of our components uses Parquet V2 . >>>>> >>>>> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue <b...@tabular.io> wrote: >>>>> >>>>>> Hi Prem, >>>>>> >>>>>> Parquet v1 is the default because v2 has not been finalized and >>>>>> adopted by the community. I highly recommend not using v2 encodings at >>>>>> this >>>>>> time. >>>>>> >>>>>> Ryan >>>>>> >>>>>> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo <prem.re...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I am using spark 3.2.0 . but my spark package comes with parquet-mr >>>>>>> 1.2.1 which writes in parquet version 1 not version version 2:(. so I >>>>>>> was >>>>>>> looking how to write in Parquet version2 ? >>>>>>> >>>>>>> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh < >>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>> >>>>>>>> Sorry you have a point there. It was released in version 3.00. What >>>>>>>> version of spark are you using? >>>>>>>> >>>>>>>> Technologist | Solutions Architect | Data Engineer | Generative AI >>>>>>>> London >>>>>>>> United Kingdom >>>>>>>> >>>>>>>> >>>>>>>> view my Linkedin profile >>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>> >>>>>>>> >>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>> note >>>>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>>>> expert opinions (Werner >>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>> >>>>>>>> >>>>>>>> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo <prem.re...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Thank you so much for the info! But do we have any release notes >>>>>>>>> where it says spark2.4.0 onwards supports parquet version 2. I was >>>>>>>>> under >>>>>>>>> the impression Spark3.0 onwards it started supporting . >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh < >>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Well if I am correct, Parquet version 2 support was introduced in >>>>>>>>>> Spark version 2.4.0. Therefore, any version of Spark starting from >>>>>>>>>> 2.4.0 >>>>>>>>>> supports Parquet version 2. Assuming that you are using Spark version >>>>>>>>>> 2.4.0 or later, you should be able to take advantage of Parquet >>>>>>>>>> version 2 >>>>>>>>>> features. >>>>>>>>>> >>>>>>>>>> HTH >>>>>>>>>> >>>>>>>>>> Mich Talebzadeh, >>>>>>>>>> Technologist | Solutions Architect | Data Engineer | Generative >>>>>>>>>> AI >>>>>>>>>> London >>>>>>>>>> United Kingdom >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> view my Linkedin profile >>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> *Disclaimer:* The information provided is correct to the best of >>>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to >>>>>>>>>> note >>>>>>>>>> that, as with any advice, quote "one test result is worth >>>>>>>>>> one-thousand >>>>>>>>>> expert opinions (Werner >>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo <prem.re...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Thank you for the information! >>>>>>>>>>> I can use any version of parquet-mr to produce parquet file. >>>>>>>>>>> >>>>>>>>>>> regarding 2nd question . >>>>>>>>>>> Which version of spark is supporting parquet version 2? >>>>>>>>>>> May I get the release notes where parquet versions are mentioned >>>>>>>>>>> ? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh < >>>>>>>>>>> mich.talebza...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Parquet-mr is a Java library that provides functionality for >>>>>>>>>>>> working with Parquet files with hadoop. It is therefore more >>>>>>>>>>>> geared >>>>>>>>>>>> towards working with Parquet files within the Hadoop ecosystem, >>>>>>>>>>>> particularly using MapReduce jobs. There is no definitive way to >>>>>>>>>>>> check >>>>>>>>>>>> exact compatible versions within the library itself. However, you >>>>>>>>>>>> can have >>>>>>>>>>>> a look at this >>>>>>>>>>>> >>>>>>>>>>>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md >>>>>>>>>>>> >>>>>>>>>>>> HTH >>>>>>>>>>>> >>>>>>>>>>>> Mich Talebzadeh, >>>>>>>>>>>> Technologist | Solutions Architect | Data Engineer | >>>>>>>>>>>> Generative AI >>>>>>>>>>>> London >>>>>>>>>>>> United Kingdom >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> view my Linkedin profile >>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best >>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is >>>>>>>>>>>> essential to >>>>>>>>>>>> note that, as with any advice, quote "one test result is worth >>>>>>>>>>>> one-thousand expert opinions (Werner >>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo <prem.re...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello Team, >>>>>>>>>>>>> May I know how to check which version of parquet is supported >>>>>>>>>>>>> by parquet-mr 1.2.1 ? >>>>>>>>>>>>> >>>>>>>>>>>>> Which version of parquet-mr is supporting parquet version 2 >>>>>>>>>>>>> (V2) ? >>>>>>>>>>>>> >>>>>>>>>>>>> Which version of spark is supporting parquet version 2? >>>>>>>>>>>>> May I get the release notes where parquet versions are >>>>>>>>>>>>> mentioned ? >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>> >>>>>> -- >>>>>> Ryan Blue >>>>>> Tabular >>>>>> >>>>> >>>> >>>> -- >>>> Ryan Blue >>>> Tabular >>>> >>>