I am not sure what "Parquet community V2 is not final yet" means. We are
now at parquet-format 2.10.0. The current parquet-mr supports most (if not
all) of its features. I agree the current mechanism in parquet-mr of
setting the writer version PARQUET_1_0 and PARQUET_2_0 is not
clear/misleading. We should work on this from format point of view as well.
BUT, there is no such thing as "finalizing" parquet-format V2.

AFAIK Spark does support setting the writer version since it uses
parquet-mr. Try the hadoop configuration "parquet.writer.version" set to
"v2". Of course, it also supports reading these files by default.

Prem Sahoo <[email protected]> ezt írta (időpont: 2024. ápr. 24., Sze,
14:05):

> Hello Gang/Team,
> Thanks for your reply.
> As per your suggestion there is none to differentiate if the Parquet is
> written thru V2 or V1 which is very confusing .
> We should have some flag or tag which differentiates Parquet written in V1
> or V2. While reading if the engine doesn't support V2 reading then we are
> sure we shouldn't feed V2 Parquet.
>
> Now few Tools/products are using Parquet V2 for both reading & writing but*
> Apache Spark is not supporting write through V2 encoding as per Parquet
> community V2 is not final yet*.
>
> Do we have any date when the parquet-mr jar will have Parquet V2 writing
> functionality so that Spark can adhere to it.
>
> On Wed, Apr 24, 2024 at 1:28 AM Gang Wu <[email protected]> wrote:
>
> > As I have said in another thread, Parquet V2 is a concept which contains
> > a lot of features. FWIW, what are defined in the specs [1] are finalized
> > and
> > some of them have been implemented in various implementations. Any file
> > that contains one or more of those features can be considered v2 but the
> > community has never defined a formal approach to distinguish between
> > v1 and v2. Parquet does have a field in the footer thrift definition to
> > mark
> > the file version [2]. However, not all implementations populate it
> > correctly and
> > some engines will even throw if the version is not 1. To avoid
> confusion, I
> > strongly suggest not using any v2 feature in your case unless you are
> 100%
> > sure that all your tools support the v2 feature set you have enabled.
> >
> > [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md
> > [2]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1111
> >
> > Best,
> > Gang
> >
> > On Wed, Apr 24, 2024 at 10:29 AM Prem Sahoo <[email protected]>
> wrote:
> >
> > > Any one please shed some light on this ?
> > > Sent from my iPhone
> > >
> > > > On Apr 23, 2024, at 4:30 PM, Prem Sahoo <[email protected]>
> wrote:
> > > >
> > > > Hello Team,
> > > > How to find out if the Parquet file is V1 or V2 ?
> > > >
> > > > Do we have any tag/identifier which can say a Parquet file is created
> > > thru V2 or V1 ?
> > > >
> > > > Is there any specific properties need to be there then only that
> > parquet
> > > can be written in Parquet V2?
> > > > Sent from my iPhone
> > >
> >
>

Reply via email to