Hi Vinoo,

AFAIK parquet-mr is currently the richest implementation in terms of
parquet-format features making it the "reference" implementation. We do not
need a parquet-mr 2.0 to support more parquet-format features.

I agree we need to address somehow the "feature levels" to make the
implementations life easier and also increase cross compatibility between
them. But I don't think this should be started from parquet-mr but from
parquet-format. There are a couple of features that are must to have to
support reading Parquet files created by any implementations (e.g.
encodings, compression etc.). But there are others that are not required to
read the file (e.g. support of statistics). Also, an implementation might
choose to deliberately not implement a feature because it is not required
by the related system (e.g. encryption). So, there are many questions to
handle.

Vinoo Ganesh <[email protected]> ezt írta (időpont: 2024. ápr. 25.,
Cs, 16:21):

> Hi Gabor - that's a good point. When I say unstable, I don't mean the code
> itself, I mean universal "compatibility" of the files produced (ex. Prem's
> case of compatibility of Parquet files produced/consumed by Spark +
> Dremio).
> <[email protected]>
>
> In the last Parquet meeting, I brought up discussing / planning for a
> parquet-mr 2.0 release which I think should at least establish a parquet-mr
> release as the "formal implementation" of the standard (even if it's mostly
> a vanity release).
>
> On Thu, Apr 25, 2024 at 9:36 AM Gábor Szádovszky <[email protected]> wrote:
>
> > Hey,
> >
> > I don't think we should call Parquet v2.x features unstable. Since they
> > were released officially, we maintain backward compatibility. So, from
> > Parquet format point of view, these features are stable.
> > It is another question whether a Parquet implementation supports all of
> > these features or only a subset of them. I think, parquet-mr and
> > parquet-cpp (Arrow) are keeping up well with these features. Other
> > implementation (e.g. Impala) might be lagging behind.
> > I agree it is very hard for the implementations to implement everything
> or
> > choose what is really required. There was an initiative a couple of years
> > ago that I've started but failed to finish. See
> > https://github.com/apache/parquet-format/pull/164 for details.
> >
> > I think the main question is what systems do you create your parquet
> files
> > for. If you can list these systems (e.g. Spark, Hive etc.) you can
> > validate if the files are working with them properly. In many cases
> > parquet-mr or parquet-cpp are the actual implementation behind. If you
> want
> > to create parquet files for any systems to read, you should not use newer
> > features. (The encodings we are talking about as v2 encodings are 10+
> years
> > old in the Parquet spec.) But keep in mind that in many cases it is not
> > that simple. For example compression codecs might be supported in a
> system
> > or not independently from the actual Parquet implementation. For
> parquet-mr
> > it is expected to have the related native libraries installed for some
> > codecs.
> >
> > Cheers,
> > Gabor
> >
> > Prem Sahoo <[email protected]> ezt írta (időpont: 2024. ápr. 24.,
> Sze,
> > 20:10):
> >
> > > Hello Vinoo,
> > > Thanks for your assistance . Pyarrow folks are using Parquet V2 though
> it
> > > is not recommended . I don't want to make any mess so I am just
> checking
> > > with all different groups .
> > >
> > > On Wed, Apr 24, 2024 at 12:31 PM Vinoo Ganesh <[email protected]>
> > > wrote:
> > >
> > > > I'm not sure what you're looking for. A few different folks
> (Ryan/Steve
> > > on
> > > > the Spark list, Wes on the Arrow list, and Gang/me on the Parquet
> list)
> > > > have said that they wouldn't recommend using the Parquet V2
> encodings,
> > > but
> > > > you're free to do whatever you want in your own data stack, as are
> the
> > > > clients who are using Parquet V2. Again, I (and others) personally
> > > wouldn't
> > > > recommend storing production data in an unstable format, and that's
> the
> > > > reason we are warning against it.
> > > >
> > > > On Wed, Apr 24, 2024 at 11:47 AM Prem Sahoo <[email protected]>
> > > wrote:
> > > >
> > > >> Hello Vinoo,
> > > >> Can you please share a link where it says Parquet V2 is not official
> > or
> > > >> not stable for use by third parties ?
> > > >>
> > > >>
> > > >> On Wed, Apr 24, 2024 at 11:28 AM Vinoo Ganesh <
> [email protected]
> > >
> > > >> wrote:
> > > >>
> > > >>> Hi Prem, Wes' comment on the thread you posted on the arrow dev
> list
> > > >>> should clear up your confusion:
> > > >>> https://lists.apache.org/thread/72qwr66wf3xyrl5cozgojz88ct23qzxx.
> > > There
> > > >>> is a difference between the "standard" itself (parquet-format) and
> > the
> > > >>> implementation (parquet-mr, etc...).
> > > >>>
> > > >>> Parquet-format (https://github.com/apache/parquet-format) contains
> > > >>> mostly just the docs and thrift definition now that a PR to clean
> up
> > > the
> > > >>> remaining deprecated code was just merged. Releases of this just
> > > format,
> > > >>> which again, is mostly just docs, is what Gang was referring to in
> > [2].
> > > >>>
> > > >>> We just started conversations about how a Parquet 2.0 release may
> > look
> > > >>> in the meeting yesterday. As these conversations progress, the dev
> > list
> > > >>> will be kept updated.
> > > >>>
> > > >>>
> > > >>> On Wed, Apr 24, 2024 at 11:10 AM Prem Sahoo <[email protected]>
> > > >>> wrote:
> > > >>>
> > > >>>> Hello Vinoo/Team,
> > > >>>> As per pyarrow Team , They  don't see any concern , please check
> > > below.
> > > >>>> Please let us know *where it says Parquet V2 is not official *
> > > >>>>
> > > >>>> "> *As per Apache Parquet Community Parquet V2 is not final yet so
> > it
> > > >>>> is not
> > > >>>> > official . They are advising not to use Parquet V2 for writing
> > > (though
> > > >>>> code
> > > >>>> > is available ) .*
> > > >>>>
> > > >>>> This would be news to me.  Parquet releases are listed (by the
> > parquet
> > > >>>> community) at [1]
> > > >>>>
> > > >>>> The vote to release parquet 2.10 is here: [2]
> > > >>>>
> > > >>>>
> > > >>>> *Neither of these links mention anything about this being an
> > > >>>> experimental,unofficial, or non-finalized release.*
> > > >>>>
> > > >>>> I understand your concern.  I believe your quotes are coming from
> > your
> > > >>>> discussion on the parquet mailing list here [3].  This
> communication
> > > is
> > > >>>> unfortunate and confusing to me as well.
> > > >>>>
> > > >>>> [1] https://parquet.apache.org/blog/
> > > >>>> [2]
> > https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
> > > >>>> [3]
> > https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3";
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Apr 22, 2024 at 4:56 PM Prem Sahoo <[email protected]>
> > > >>>> wrote:
> > > >>>>
> > > >>>>> Hello Vinoo/Team,.
> > > >>>>> I was going through pyarrow and they have started using V2 as
> > default
> > > >>>>> . isn't it they should avoid it as it is not official.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> > > >>>>>
> > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6”
> > > >>>>>
> > > >>>>> Determine which Parquet logical types are available for use,
> > whether
> > > >>>>> the reduced set from the Parquet 1.x.x format or the expanded
> > > logical types
> > > >>>>> added in later format versions. Files written with version=’2.4’
> or
> > > ‘2.6’
> > > >>>>> may not be readable in all Parquet implementations, so
> > version=’1.0’
> > > is
> > > >>>>> likely the choice that maximizes file compatibility. UINT32 and
> > some
> > > >>>>> logical types are only available with version ‘2.4’. Nanosecond
> > > timestamps
> > > >>>>> are only available with version ‘2.6’. Other features such as
> > > compression
> > > >>>>> algorithms or the new serialized data page format must be enabled
> > > >>>>> separately (see ‘compression’ and ‘data_page_version’).
> > > >>>>>
> > > >>>>
> > >
> >
>

Reply via email to