Re: Which version of spark version supports parquet version 2 ?

Steve Loughran Fri, 19 Apr 2024 07:19:00 -0700

Those are some quite good improvements -but committing to storing all your
data in an unstable format, is, well, "bold". For temporary data as part of
a workflow though, it could be appealing


Now, assuming you are going to be working with s3, you might want to start
with merging PARQUET-2117 into your version, as it is delivering tangible
speedups through parallel GET of different range downloads, at least
according to our most recent test runs.

[image: Screenshot 2024-04-12 at 11.38.02 AM.png]

what would be interesting is to see how the two combine: v2 and java21 AVX
processing of data, plus the 4X improvement in data retrieval (we limit the
#of active requests per stream to realistic numbers you can use in
production, FWIW).

See also; An Empirical Evaluation of Columnar Storage Formats
https://arxiv.org/abs/2304.05028

On Thu, 18 Apr 2024 at 08:31, Bjørn Jørgensen <bjornjorgen...@gmail.com>
wrote:

> " *Release 24.3 of Dremio will continue to write Parquet V1, since an
> average performance degradation of 1.5% was observed in writes and 6.5% was
> observed in queries when TPC-DS data was written using Parquet V2 instead
> of Parquet V1.  The aforementioned query performance tests utilized the C3
> cache to store data.*"
> (...)
> "*Users can enable Parquet V2 on write using the following configuration
> key.*
>
> ALTER SYSTEM SET "store.parquet.writer.version" = 'v2' "
>
> https://www.dremio.com/blog/vectorized-reading-of-parquet-v2-improves-performance-up-to-75/
>
> "*Java Vector API support*
>
>
>
>
>
>
>
>
>
>
>
> *The feature is experimental and is currently not part of the parquet
> distribution. Parquet-MR has supported Java Vector API to speed up reading,
> to enable this feature:Java 17+, 64-bitRequiring the CPU to support
> instruction sets:avx512vbmiavx512_vbmi2To build the jars: mvn clean package
> -P vector-pluginsFor Apache Spark to enable this feature:Build parquet and
> replace the parquet-encoding-{VERSION}.jar on the spark jars folderBuild
> parquet-encoding-vector and copy parquet-encoding-vector-{VERSION}.jar to
> the spark jars folderEdit spark class#VectorizedRleValuesReader,
> function#readNextGroup refer to parquet class#ParquetReadRouter,
> function#readBatchUsing512VectorBuild spark with maven and replace
> spark-sql_2.12-{VERSION}.jar on the spark jars folder*"
>
>
> https://github.com/apache/parquet-mr?tab=readme-ov-file#java-vector-api-support
>
> You are using spark 3.2.0
> spark version 3.2.4 was released April 13, 2023
> https://spark.apache.org/releases/spark-release-3-2-4.html
> You are using a spark version that is EOL.
>
> tor. 18. apr. 2024 kl. 00:25 skrev Prem Sahoo <prem.re...@gmail.com>:
>
>> Hello Ryan,
>> May I know how you can write Parquet V2 encoding from spark 3.2.0 ?  As
>> per my knowledge Dremio is creating and reading Parquet V2.
>> "Apache Parquet-MR Writer version PARQUET_2_0, which is widely adopted
>> by engines that write Parquet data, supports delta encodings. However,
>> these encodings were not previously supported by Dremio's vectorized
>> Parquet reader, resulting in decreased speed. Now, in version 24.3 and
>> Dremio Cloud, when you use the Dremio SQL query engine on Parquet datasets,
>> you’ll receive best-in-class performance."
>>
>> Could you let me know where Parquet Community is not recommending Parquet
>> V2 ?
>>
>>
>>
>> On Wed, Apr 17, 2024 at 2:44 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> Prem, as I said earlier, v2 is not a finalized spec so you should not
>>> use it. That's why it is not the default. You can get Spark to write v2
>>> files, but it isn't recommended by the Parquet community.
>>>
>>> On Wed, Apr 17, 2024 at 11:05 AM Prem Sahoo <prem.re...@gmail.com>
>>> wrote:
>>>
>>>> Hello Community,
>>>> Could anyone shed more light on this (Spark Supporting Parquet V2)?
>>>>
>>>> On Tue, Apr 16, 2024 at 3:42 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hi Prem,
>>>>>
>>>>> Regrettably this is not my area of speciality. I trust
>>>>> another colleague will have a more informed idea. Alternatively you may
>>>>> raise an SPIP for it.
>>>>>
>>>>> Spark Project Improvement Proposals (SPIP) | Apache Spark
>>>>> <https://spark.apache.org/improvement-proposals.html>
>>>>>
>>>>> HTH
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>
>>>>>
>>>>> On Tue, 16 Apr 2024 at 18:17, Prem Sahoo <prem.re...@gmail.com> wrote:
>>>>>
>>>>>> Hello Mich,
>>>>>> Thanks for example.
>>>>>> I have the same parquet-mr version which creates Parquet version 1.
>>>>>> We need to create V2 as it is more optimized. We have Dremio where if we
>>>>>> use Parquet V2 it is 75% better than Parquet V1 in case of read and 25 %
>>>>>> better in case of write . so we are inclined towards this way.  Please 
>>>>>> let
>>>>>> us know why Spark is not going towards Parquet V2 ?
>>>>>> Sent from my iPhone
>>>>>>
>>>>>> On Apr 16, 2024, at 1:04 PM, Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>
>>>>>> 
>>>>>> Well let us do a test in PySpark.
>>>>>>
>>>>>> Take this code and create a default parquet file. My spark is 3.4
>>>>>>
>>>>>> cat parquet_checxk.py
>>>>>> from pyspark.sql import SparkSession
>>>>>>
>>>>>> spark =
>>>>>> SparkSession.builder.appName("ParquetVersionExample").getOrCreate()
>>>>>>
>>>>>> data = [("London", 8974432), ("New York City", 8804348), ("Beijing",
>>>>>> 21893000)]
>>>>>> df = spark.createDataFrame(data, ["city", "population"])
>>>>>>
>>>>>> df.write.mode("overwrite").parquet("parquet_example")  # it create
>>>>>> file in hdfs directory
>>>>>>
>>>>>> Use a tool called parquet-tools (downloadable using pip from
>>>>>> https://pypi.org/project/parquet-tools/)
>>>>>>
>>>>>> Get the parquet files from hdfs to the current directory say
>>>>>>
>>>>>> hdfs dfs -get /user/hduser/parquet_example .
>>>>>> cd ./parquet_example
>>>>>> do an ls and pickup file 3 like below to inspect
>>>>>>  parquet-tools inspect
>>>>>> part-00003-c33854c8-a8b6-4315-bf51-20198ce0ba62-c000.snappy.parquet
>>>>>>
>>>>>> Now this is the output
>>>>>>
>>>>>> ############ file meta data ############
>>>>>> created_by: parquet-mr version 1.12.3 (build
>>>>>> f8dced182c4c1fbdec6ccb3185537b5a01e6ed6b)
>>>>>> num_columns: 2
>>>>>> num_rows: 1
>>>>>> num_row_groups: 1
>>>>>> format_version: 1.0
>>>>>> serialized_size: 563
>>>>>>
>>>>>>
>>>>>> ############ Columns ############
>>>>>> name
>>>>>> age
>>>>>>
>>>>>> ############ Column(name) ############
>>>>>> name: name
>>>>>> path: name
>>>>>> max_definition_level: 1
>>>>>> max_repetition_level: 0
>>>>>> physical_type: BYTE_ARRAY
>>>>>> logical_type: String
>>>>>> converted_type (legacy): UTF8
>>>>>> compression: SNAPPY (space_saved: -5%)
>>>>>>
>>>>>> ############ Column(age) ############
>>>>>> name: age
>>>>>> path: age
>>>>>> max_definition_level: 1
>>>>>> max_repetition_level: 0
>>>>>> physical_type: INT64
>>>>>> logical_type: None
>>>>>> converted_type (legacy): NONE
>>>>>> compression: SNAPPY (space_saved: -5%)
>>>>>>
>>>>>> File Information:
>>>>>>
>>>>>>    - format_version: 1.0: This line explicitly states that the
>>>>>>    format version of the Parquet file is 1.0, which corresponds to 
>>>>>> Parquet
>>>>>>    version 1.
>>>>>>    - created_by: parquet-mr version 1.12.3: While this doesn't
>>>>>>    directly specify the format version, itt is accepted that older 
>>>>>> versions of
>>>>>>    parquet-mr like 1.12.3 typically write Parquet version 1 files.
>>>>>>
>>>>>> Since in this case Spark 3.4 is capable of reading both versions (1
>>>>>> and 2), you don't  necessarily need to modify your Spark code to access
>>>>>> this file. However, if you want to create Parquet files in version 2 
>>>>>> using
>>>>>> Spark, you might need to consider additional changes like excluding
>>>>>> parquet-mr or upgrading Parquet libraries and do a custom build.of Spark.
>>>>>> However, taking klaws of diminishing returns, I would not advise that
>>>>>> either.. You can ofcourse usse gzip for compression that may be more
>>>>>> suitable for your needs.
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>> expert opinions (Werner
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>
>>>>>>
>>>>>> On Tue, 16 Apr 2024 at 15:00, Prem Sahoo <prem.re...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Community,
>>>>>>> Could any of you shed some light on below questions please ?
>>>>>>> Sent from my iPhone
>>>>>>>
>>>>>>> On Apr 15, 2024, at 9:02 PM, Prem Sahoo <prem.re...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> 
>>>>>>> Any specific reason spark does not support or community doesn't want
>>>>>>> to go to Parquet V2 , which is more optimized and read and write is too
>>>>>>> much faster (form other component which I am using)
>>>>>>>
>>>>>>> On Mon, Apr 15, 2024 at 7:55 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>
>>>>>>>> Spark will read data written with v2 encodings just fine. You just
>>>>>>>> don't need to worry about making Spark produce v2. And you should 
>>>>>>>> probably
>>>>>>>> also not produce v2 encodings from other systems.
>>>>>>>>
>>>>>>>> On Mon, Apr 15, 2024 at 4:37 PM Prem Sahoo <prem.re...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> oops but so spark does not support parquet V2  atm ?, as We have a
>>>>>>>>> use case where we need parquet V2 as  one of our components uses 
>>>>>>>>> Parquet V2
>>>>>>>>> .
>>>>>>>>>
>>>>>>>>> On Mon, Apr 15, 2024 at 7:09 PM Ryan Blue <b...@tabular.io> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Prem,
>>>>>>>>>>
>>>>>>>>>> Parquet v1 is the default because v2 has not been finalized and
>>>>>>>>>> adopted by the community. I highly recommend not using v2 encodings 
>>>>>>>>>> at this
>>>>>>>>>> time.
>>>>>>>>>>
>>>>>>>>>> Ryan
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 15, 2024 at 3:05 PM Prem Sahoo <prem.re...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I am using spark 3.2.0 . but my spark package comes with
>>>>>>>>>>> parquet-mr 1.2.1 which writes in parquet version 1 not version 
>>>>>>>>>>> version 2:(.
>>>>>>>>>>> so I was looking how to write in Parquet version2 ?
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 15, 2024 at 5:05 PM Mich Talebzadeh <
>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Sorry you have a point there. It was released in version 3.00.
>>>>>>>>>>>> What version of spark are you using?
>>>>>>>>>>>>
>>>>>>>>>>>> Technologist | Solutions Architect | Data Engineer  |
>>>>>>>>>>>> Generative AI
>>>>>>>>>>>> London
>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the best
>>>>>>>>>>>> of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>>> essential to
>>>>>>>>>>>> note that, as with any advice, quote "one test result is worth
>>>>>>>>>>>> one-thousand expert opinions (Werner
>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, 15 Apr 2024 at 21:33, Prem Sahoo <prem.re...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thank you so much for the info! But do we have any release
>>>>>>>>>>>>> notes where it says spark2.4.0 onwards supports parquet version 
>>>>>>>>>>>>> 2. I was
>>>>>>>>>>>>> under the impression Spark3.0 onwards it started supporting .
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Apr 15, 2024 at 4:28 PM Mich Talebzadeh <
>>>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Well if I am correct, Parquet version 2 support was
>>>>>>>>>>>>>> introduced in Spark version 2.4.0. Therefore, any version of 
>>>>>>>>>>>>>> Spark starting
>>>>>>>>>>>>>> from 2.4.0 supports Parquet version 2. Assuming that you are 
>>>>>>>>>>>>>> using Spark
>>>>>>>>>>>>>> version  2.4.0 or later, you should be able to take advantage of 
>>>>>>>>>>>>>> Parquet
>>>>>>>>>>>>>> version 2 features.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>>>> Technologist | Solutions Architect | Data Engineer  |
>>>>>>>>>>>>>> Generative AI
>>>>>>>>>>>>>> London
>>>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the
>>>>>>>>>>>>>> best of my knowledge but of course cannot be guaranteed . It is 
>>>>>>>>>>>>>> essential
>>>>>>>>>>>>>> to note that, as with any advice, quote "one test result is
>>>>>>>>>>>>>> worth one-thousand expert opinions (Werner
>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, 15 Apr 2024 at 20:53, Prem Sahoo <
>>>>>>>>>>>>>> prem.re...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for the information!
>>>>>>>>>>>>>>> I can use any version of parquet-mr to produce parquet file.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> regarding 2nd question .
>>>>>>>>>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>>>>>>>>>> May I get the release notes where parquet versions are
>>>>>>>>>>>>>>> mentioned ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Mon, Apr 15, 2024 at 2:34 PM Mich Talebzadeh <
>>>>>>>>>>>>>>> mich.talebza...@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Parquet-mr is a Java library that provides functionality
>>>>>>>>>>>>>>>> for working with Parquet files with hadoop. It is therefore  
>>>>>>>>>>>>>>>> more geared
>>>>>>>>>>>>>>>> towards working with Parquet files within the Hadoop ecosystem,
>>>>>>>>>>>>>>>> particularly using MapReduce jobs. There is no definitive way 
>>>>>>>>>>>>>>>> to check
>>>>>>>>>>>>>>>> exact compatible versions within the library itself. However, 
>>>>>>>>>>>>>>>> you can have
>>>>>>>>>>>>>>>> a look at this
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> https://github.com/apache/parquet-mr/blob/master/CHANGES.md
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Mich Talebzadeh,
>>>>>>>>>>>>>>>> Technologist | Solutions Architect | Data Engineer  |
>>>>>>>>>>>>>>>> Generative AI
>>>>>>>>>>>>>>>> London
>>>>>>>>>>>>>>>> United Kingdom
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    view my Linkedin profile
>>>>>>>>>>>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *Disclaimer:* The information provided is correct to the
>>>>>>>>>>>>>>>> best of my knowledge but of course cannot be guaranteed . It 
>>>>>>>>>>>>>>>> is essential
>>>>>>>>>>>>>>>> to note that, as with any advice, quote "one test result is
>>>>>>>>>>>>>>>> worth one-thousand expert opinions (Werner
>>>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>>>>>>>>>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Mon, 15 Apr 2024 at 18:59, Prem Sahoo <
>>>>>>>>>>>>>>>> prem.re...@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hello Team,
>>>>>>>>>>>>>>>>> May I know how to check which version of parquet is
>>>>>>>>>>>>>>>>> supported by parquet-mr 1.2.1 ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Which version of parquet-mr is supporting parquet version
>>>>>>>>>>>>>>>>> 2 (V2) ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Which version of spark is supporting parquet version 2?
>>>>>>>>>>>>>>>>> May I get the release notes where parquet versions are
>>>>>>>>>>>>>>>>> mentioned ?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Ryan Blue
>>>>>>>>>> Tabular
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Ryan Blue
>>>>>>>> Tabular
>>>>>>>>
>>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Re: Which version of spark version supports parquet version 2 ?

Reply via email to