Hi Prem,

Sorry for not answering your original question. I wanted to make it clear
that "v1" and "v2" are not well defined behaviors. In parquet-mr if you set
WriterVersion.PARQUET_2_0, it might use the delta encodings
(DELTA_BINARY_PACKED, DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY) for the
related types but dictionary encoding has precedence, so it might happen
that such a file does not have any of these encodings.

What is more reliable is the page header. In the case of
WriterVersion.PARQUET_2_0, PageHeaderV2 will be used for data pages. It
also appears in the footer in the EncodingStats of each column chunks. You
may use the parquet-cli tool to check this:
parquet-cli footer {parquet file} | grep usesV2Pages

Cheers,
Gabor

Prem Sahoo <prem.re...@gmail.com> ezt írta (időpont: 2024. ápr. 25., Cs,
22:16):

> Yes, all your information is consumed .
> but how to differentiate between Parquet files written thru V2 or V1 , no
> one in the community has a clear idea about this which is a bit
> astonishing .
>
> if any one is aware , it will be highly appreciated.
>
>
>
> On Thu, Apr 25, 2024 at 10:32 AM Gábor Szádovszky <ga...@apache.org>
> wrote:
>
> > I am not sure what "Parquet community V2 is not final yet" means. We are
> > now at parquet-format 2.10.0. The current parquet-mr supports most (if
> not
> > all) of its features. I agree the current mechanism in parquet-mr of
> > setting the writer version PARQUET_1_0 and PARQUET_2_0 is not
> > clear/misleading. We should work on this from format point of view as
> well.
> > BUT, there is no such thing as "finalizing" parquet-format V2.
> >
> > AFAIK Spark does support setting the writer version since it uses
> > parquet-mr. Try the hadoop configuration "parquet.writer.version" set to
> > "v2". Of course, it also supports reading these files by default.
> >
> > Prem Sahoo <prem.re...@gmail.com> ezt írta (időpont: 2024. ápr. 24.,
> Sze,
> > 14:05):
> >
> > > Hello Gang/Team,
> > > Thanks for your reply.
> > > As per your suggestion there is none to differentiate if the Parquet is
> > > written thru V2 or V1 which is very confusing .
> > > We should have some flag or tag which differentiates Parquet written in
> > V1
> > > or V2. While reading if the engine doesn't support V2 reading then we
> are
> > > sure we shouldn't feed V2 Parquet.
> > >
> > > Now few Tools/products are using Parquet V2 for both reading & writing
> > but*
> > > Apache Spark is not supporting write through V2 encoding as per Parquet
> > > community V2 is not final yet*.
> > >
> > > Do we have any date when the parquet-mr jar will have Parquet V2
> writing
> > > functionality so that Spark can adhere to it.
> > >
> > > On Wed, Apr 24, 2024 at 1:28 AM Gang Wu <ust...@gmail.com> wrote:
> > >
> > > > As I have said in another thread, Parquet V2 is a concept which
> > contains
> > > > a lot of features. FWIW, what are defined in the specs [1] are
> > finalized
> > > > and
> > > > some of them have been implemented in various implementations. Any
> file
> > > > that contains one or more of those features can be considered v2 but
> > the
> > > > community has never defined a formal approach to distinguish between
> > > > v1 and v2. Parquet does have a field in the footer thrift definition
> to
> > > > mark
> > > > the file version [2]. However, not all implementations populate it
> > > > correctly and
> > > > some engines will even throw if the version is not 1. To avoid
> > > confusion, I
> > > > strongly suggest not using any v2 feature in your case unless you are
> > > 100%
> > > > sure that all your tools support the v2 feature set you have enabled.
> > > >
> > > > [1] https://github.com/apache/parquet-format/blob/master/CHANGES.md
> > > > [2]
> > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1111
> > > >
> > > > Best,
> > > > Gang
> > > >
> > > > On Wed, Apr 24, 2024 at 10:29 AM Prem Sahoo <prem.re...@gmail.com>
> > > wrote:
> > > >
> > > > > Any one please shed some light on this ?
> > > > > Sent from my iPhone
> > > > >
> > > > > > On Apr 23, 2024, at 4:30 PM, Prem Sahoo <prem.re...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > Hello Team,
> > > > > > How to find out if the Parquet file is V1 or V2 ?
> > > > > >
> > > > > > Do we have any tag/identifier which can say a Parquet file is
> > created
> > > > > thru V2 or V1 ?
> > > > > >
> > > > > > Is there any specific properties need to be there then only that
> > > > parquet
> > > > > can be written in Parquet V2?
> > > > > > Sent from my iPhone
> > > > >
> > > >
> > >
> >
>

Reply via email to