If you still want to write your production Parquet V2, which again, is not a finalized format and is therefore NOT recommended, you can override parquet.writer.version ( https://github.com/apache/parquet-mr/blob/f51ed41ded4d91c18fc4eaa827664bc3a02b18f3/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java#L142) in the latest parquet-mr, and set it to output V2 format.
Again, both the Spark dev list and Parquet dev list have warned against this, so I'd recommend you proceed with caution. <[email protected]> On Sun, Apr 21, 2024 at 6:50 PM Prem Sahoo <[email protected]> wrote: > Hello Team, > Do you have any clue in which version of parquet-mr jar Parquet V2 > encoding code is available ? > > On Sun, Apr 21, 2024 at 6:21 PM Prem Sahoo <[email protected]> wrote: > >> Thanks Vinoo for the valuable information . >> >> On Sat, Apr 20, 2024 at 5:07 PM Vinoo Ganesh <[email protected]> >> wrote: >> >>> Hi Prem - Maybe I can help clarify to the best of my knowledge. Parquet >>> V2 >>> as a standard isn't finalized just yet. Meaning there is no formal, >>> *finalized* "contract" that specifies what it means to write data in the >>> V2 >>> version. The discussions/conversations about what the final V2 standard >>> may >>> be are still in progress and are evolving. >>> >>> That being said, because V2 code does exist (though unfinalized), there >>> are >>> clients / tools that are writing data in the un-finalized V2 format, as >>> seems to be the case with Dremio. >>> >>> Now, as that comment you quoted said, you can have Spark write V2 files, >>> but it's worth being mindful about the fact that V2 is a moving target >>> and >>> can (and likely will) change. You can overwrite parquet.writer.version to >>> specify your desired version, but it can be dangerous to produce data in >>> a >>> moving-target format. For example, let's say you write a bunch of data in >>> Parquet V2, and then the community decides to make a breaking change >>> (which >>> is completely fine / allowed since V2 isn't finalized). You are now left >>> having to deal with a potentially large and complicated file format >>> update. >>> That's why it's not recommended to write files in parquet v2 just yet. >>> >>> >>> >>> <[email protected]> >>> >>> >>> On Wed, Apr 17, 2024 at 3:47 PM Prem Sahoo <[email protected]> wrote: >>> >>> > Hello Team, >>> > I am working on different products such as Spark and Dremio. >>> > >>> > Dremio is able to write and read Parquet V2 and due this upgrade it is >>> > working faster than Parquet V1 files. >>> > >>> > In case of spark it is still defaulting to Parquet V1 and when I >>> > checked with Spark community they told me Parquet community isn't >>> > recommending Parquet V2. >>> > >>> > "Prem, as I said earlier, v2 is not a finalized spec so you should not >>> use >>> > it. That's why it is not the default. You can get Spark to write v2 >>> files, >>> > but it isn't recommended by the Parquet community." >>> > >>> > please advise. >>> > >>> >>
