If you still want to write your production Parquet V2, which again, is not
a finalized format and is therefore NOT recommended, you can override
parquet.writer.version (
https://github.com/apache/parquet-mr/blob/f51ed41ded4d91c18fc4eaa827664bc3a02b18f3/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L142)
in the latest parquet-mr, and set it to output V2 format.

Again, both the Spark dev list and Parquet dev list have warned against
this, so I'd recommend you proceed with caution.



<[email protected]>


On Sun, Apr 21, 2024 at 6:50 PM Prem Sahoo <[email protected]> wrote:

> Hello Team,
> Do you have any clue in which version of parquet-mr jar Parquet V2
> encoding code  is available ?
>
> On Sun, Apr 21, 2024 at 6:21 PM Prem Sahoo <[email protected]> wrote:
>
>> Thanks Vinoo for the valuable information .
>>
>> On Sat, Apr 20, 2024 at 5:07 PM Vinoo Ganesh <[email protected]>
>> wrote:
>>
>>> Hi Prem - Maybe I can help clarify to the best of my knowledge. Parquet
>>> V2
>>> as a standard isn't finalized just yet. Meaning there is no formal,
>>> *finalized* "contract" that specifies what it means to write data in the
>>> V2
>>> version. The discussions/conversations about what the final V2 standard
>>> may
>>> be are still in progress and are evolving.
>>>
>>> That being said, because V2 code does exist (though unfinalized), there
>>> are
>>> clients / tools that are writing data in the un-finalized V2 format, as
>>> seems to be the case with Dremio.
>>>
>>> Now, as that comment you quoted said, you can have Spark write V2 files,
>>> but it's worth being mindful about the fact that V2 is a moving target
>>> and
>>> can (and likely will) change. You can overwrite parquet.writer.version to
>>> specify your desired version, but it can be dangerous to produce data in
>>> a
>>> moving-target format. For example, let's say you write a bunch of data in
>>> Parquet V2, and then the community decides to make a breaking change
>>> (which
>>> is completely fine / allowed since V2 isn't finalized). You are now left
>>> having to deal with a potentially large and complicated file format
>>> update.
>>> That's why it's not recommended to write files in parquet v2 just yet.
>>>
>>>
>>>
>>> <[email protected]>
>>>
>>>
>>> On Wed, Apr 17, 2024 at 3:47 PM Prem Sahoo <[email protected]> wrote:
>>>
>>> > Hello Team,
>>> > I am working on different products such as Spark and Dremio.
>>> >
>>> > Dremio is able to write and read Parquet V2 and due this upgrade it is
>>> > working faster than Parquet V1 files.
>>> >
>>> > In case of spark it is still defaulting to Parquet V1 and when I
>>> > checked with Spark community they told me Parquet community isn't
>>> > recommending Parquet V2.
>>> >
>>> > "Prem, as I said earlier, v2 is not a finalized spec so you should not
>>> use
>>> > it. That's why it is not the default. You can get Spark to write v2
>>> files,
>>> > but it isn't recommended by the Parquet community."
>>> >
>>> > please advise.
>>> >
>>>
>>

Reply via email to