Re: Which version of spark version supports parquet version 2 ?

Mich Talebzadeh Tue, 16 Apr 2024 12:43:51 -0700

Hi Prem,

Regrettably this is not my area of speciality. I trust another colleague
will have a more informed idea. Alternatively you may raise an SPIP for it.


Spark Project Improvement Proposals (SPIP) | Apache Spark
<https://spark.apache.org/improvement-proposals.html>

HTH

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Tue, 16 Apr 2024 at 18:17, Prem Sahoo <[email protected]> wrote:

> Hello Mich,
> Thanks for example.
> I have the same parquet-mr version which creates Parquet version 1. We
> need to create V2 as it is more optimized. We have Dremio where if we use
> Parquet V2 it is 75% better than Parquet V1 in case of read and 25 % better
> in case of write . so we are inclined towards this way.  Please let us know
> why Spark is not going towards Parquet V2 ?
> Sent from my iPhone
>
> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh <[email protected]>
> wrote:
>
> 
> Well let us do a test in PySpark.
>
> Take this code and create a default parquet file. My spark is 3.4
>
> cat parquet_checxk.py
> from pyspark.sql import SparkSession
>
> spark = SparkSession.builder.appName("ParquetVersionExample").getOrCreate()
>
> data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
> 21893000)]
> df = spark.createDataFrame(data, ["city", "population"])
>
> df.write.mode("overwrite").parquet("parquet_example")  # it create file
> in hdfs directory
>
> Use a tool called parquet-tools (downloadable using pip from
> https://pypi.org/project/parquet-tools/)
>
> Get the parquet files from hdfs to the current directory say
>
> hdfs dfs -get /user/hduser/parquet_example .
> cd ./parquet_example
> do an ls and pickup file 3 like below to inspect
>  parquet-tools inspect
> part-00003-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet
>
> Now this is the output
>
> ############ file meta data ############
> created_by: parquet-mr version 1.12.3 (build
> f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
> num_columns: 2
> num_rows: 1
> num_row_groups: 1
> format_version: 1.0
> serialized_size: 563
>
>
> ############ Columns ############
> name
> age
>
> ############ Column(name) ############
> name: name
> path: name
> max_definition_level: 1
> max_repetition_level: 0
> physical_type: BYTE_ARRAY
> logical_type: String
> converted_type (legacy): UTF8
> compression: SNAPPY (space_saved: -5%)
>
> ############ Column(age) ############
> name: age
> path: age
> max_definition_level: 1
> max_repetition_level: 0
> physical_type: INT64
> logical_type: None
> converted_type (legacy): NONE
> compression: SNAPPY (space_saved: -5%)
>
> File Information:
>
>    - format_version: 1.0: This line explicitly states that the format
>    version of the Parquet file is 1.0, which corresponds to Parquet version 1.
>    - created_by: parquet-mr version 1.12.3: While this doesn't directly
>    specify the format version, itt is accepted that older versions of
>    parquet-mr like 1.12.3 typically write Parquet version 1 files.
>
> Since in this case Spark 3.4 is capable of reading both versions (1 and
> 2), you don't  necessarily need to modify your Spark code to access this
> file. However, if you want to create Parquet files in version 2 using
> Spark, you might need to consider additional changes like excluding
> parquet-mr or upgrading Parquet libraries and do a custom build.of Spark.
> However, taking klaws of diminishing returns, I would not advise that
> either.. You can ofcourse usse gzip for compression that may be more
> suitable for your needs.
>
> HTH
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Tue, 16 Apr 2024 at 15:00, Prem Sahoo <[email protected]> wrote:
>
>> Hello Community,
>> Could any of you shed some light on below questions please ?
>> Sent from my iPhone
>>
>> On Apr 15, 2024, at 9:02 PM, Prem Sahoo <[email protected]> wrote:
>>
>> 
>> Any specific reason spark does not support or community doesn't want to
>> go to Parquet V2 , which is more optimized and read and write is too much
>> faster (form other component which I am using)
>>
>> On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue <[email protected]> wrote:
>>
>>> Spark will read data written with v2 encodings just fine. You just don't
>>> need to worry about making Spark produce v2. And you should probably also
>>> not produce v2 encodings from other systems.
>>>
>>> On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo <[email protected]> wrote:
>>>
>>>> oops but so spark does not support parquet V2  atm ?, as We have a use
>>>> case where we need parquet V2 as  one of our components uses Parquet V2 .
>>>>
>>>> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue <[email protected]> wrote:
>>>>
>>>>> Hi Prem,
>>>>>
>>>>> Parquet v1 is the default because v2 has not been finalized and
>>>>> adopted by the community. I highly recommend not using v2 encodings at 
>>>>> this
>>>>> time.
>>>>>
>>>>> Ryan
>>>>>
>>>>> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I am using spark 3.2.0 . but my spark package comes with parquet-mr
>>>>>> 1.2.1 which writes in parquet version 1 not version version 2:(. so I was
>>>>>> looking how to write in Parquet version2 ?
>>>>>>
>>>>>> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Sorry you have a point there. It was released in version 3.00. What
>>>>>>> version of spark are you using?
>>>>>>>
>>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>> expert opinions (Werner
>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>
>>>>>>>
>>>>>>> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thank you so much for the info! But do we have any release notes
>>>>>>>> where it says spark2.4.0 onwards supports parquet version 2. I was 
>>>>>>>> under
>>>>>>>> the impression Spark3.0 onwards it started supporting .
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Well if I am correct, Parquet version 2 support was introduced in
>>>>>>>>> Spark version 2.4.0. Therefore, any version of Spark starting from 
>>>>>>>>> 2.4.0
>>>>>>>>> supports Parquet version 2. Assuming that you are using Spark version
>>>>>>>>> 2.4.0 or later, you should be able to take advantage of Parquet 
>>>>>>>>> version 2
>>>>>>>>> features.
>>>>>>>>>
>>>>>>>>> HTH
>>>>>>>>>
>>>>>>>>> Mich Talebzadeh,
>>>>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>>>>> London
>>>>>>>>> United Kingdom
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>    view my Linkedin profile
>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *Disclaimer:* The information provided is correct to the best of
>>>>>>>>> my knowledge but of course cannot be guaranteed . It is essential to 
>>>>>>>>> note
>>>>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>>>>> expert opinions (Werner
>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thank you for the information!
>>>>>>>>>> I can use any version of parquet-mr to produce parquet file.
>>>>>>>>>>
>>>>>>>>>> regarding 2nd question .
>>>>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>>>>> May I get the release notes where parquet versions are mentioned ?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Parquet-mr is a Java library that provides functionality for
>>>>>>>>>>> working with Parquet files with hadoop. It is therefore  more geared
>>>>>>>>>>> towards working with Parquet files within the Hadoop ecosystem,
>>>>>>>>>>> particularly using MapReduce jobs. There is no definitive way to 
>>>>>>>>>>> check
>>>>>>>>>>> exact compatible versions within the library itself. However, you 
>>>>>>>>>>> can have
>>>>>>>>>>> a look at this
>>>>>>>>>>>
>>>>>>>>>>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>>>>>>>>>>
>>>>>>>>>>> HTH
>>>>>>>>>>>
>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative
>>>>>>>>>>> AI
>>>>>>>>>>> London
>>>>>>>>>>> United Kingdom
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>> essential to
>>>>>>>>>>> note that, as with any advice, quote "one test result is worth
>>>>>>>>>>> one-thousand expert opinions (Werner
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello Team,
>>>>>>>>>>>> May I know how to check which version of parquet is supported
>>>>>>>>>>>> by parquet-mr 1.2.1 ?
>>>>>>>>>>>>
>>>>>>>>>>>> Which version of parquet-mr is supporting parquet version 2
>>>>>>>>>>>> (V2) ?
>>>>>>>>>>>>
>>>>>>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>>>>>>> May I get the release notes where parquet versions are
>>>>>>>>>>>> mentioned ?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Tabular
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>

Re: Which version of spark version supports parquet version 2 ?

Reply via email to