Hey,

I don't think we should call Parquet v2.x features unstable. Since they
were released officially, we maintain backward compatibility. So, from
Parquet format point of view, these features are stable.
It is another question whether a Parquet implementation supports all of
these features or only a subset of them. I think, parquet-mr and
parquet-cpp (Arrow) are keeping up well with these features. Other
implementation (e.g. Impala) might be lagging behind.
I agree it is very hard for the implementations to implement everything or
choose what is really required. There was an initiative a couple of years
ago that I've started but failed to finish. See
https://github.com/apache/parquet-format/pull/164 for details.

I think the main question is what systems do you create your parquet files
for. If you can list these systems (e.g. Spark, Hive etc.) you can
validate if the files are working with them properly. In many cases
parquet-mr or parquet-cpp are the actual implementation behind. If you want
to create parquet files for any systems to read, you should not use newer
features. (The encodings we are talking about as v2 encodings are 10+ years
old in the Parquet spec.) But keep in mind that in many cases it is not
that simple. For example compression codecs might be supported in a system
or not independently from the actual Parquet implementation. For parquet-mr
it is expected to have the related native libraries installed for some
codecs.

Cheers,
Gabor

Prem Sahoo <[email protected]> ezt írta (időpont: 2024. ápr. 24., Sze,
20:10):

> Hello Vinoo,
> Thanks for your assistance . Pyarrow folks are using Parquet V2 though it
> is not recommended . I don't want to make any mess so I am just checking
> with all different groups .
>
> On Wed, Apr 24, 2024 at 12:31 PM Vinoo Ganesh <[email protected]>
> wrote:
>
> > I'm not sure what you're looking for. A few different folks (Ryan/Steve
> on
> > the Spark list, Wes on the Arrow list, and Gang/me on the Parquet list)
> > have said that they wouldn't recommend using the Parquet V2 encodings,
> but
> > you're free to do whatever you want in your own data stack, as are the
> > clients who are using Parquet V2. Again, I (and others) personally
> wouldn't
> > recommend storing production data in an unstable format, and that's the
> > reason we are warning against it.
> >
> > On Wed, Apr 24, 2024 at 11:47 AM Prem Sahoo <[email protected]>
> wrote:
> >
> >> Hello Vinoo,
> >> Can you please share a link where it says Parquet V2 is not official or
> >> not stable for use by third parties ?
> >>
> >>
> >> On Wed, Apr 24, 2024 at 11:28 AM Vinoo Ganesh <[email protected]>
> >> wrote:
> >>
> >>> Hi Prem, Wes' comment on the thread you posted on the arrow dev list
> >>> should clear up your confusion:
> >>> https://lists.apache.org/thread/72qwr66wf3xyrl5cozgojz88ct23qzxx.
> There
> >>> is a difference between the "standard" itself (parquet-format) and the
> >>> implementation (parquet-mr, etc...).
> >>>
> >>> Parquet-format (https://github.com/apache/parquet-format) contains
> >>> mostly just the docs and thrift definition now that a PR to clean up
> the
> >>> remaining deprecated code was just merged. Releases of this just
> format,
> >>> which again, is mostly just docs, is what Gang was referring to in [2].
> >>>
> >>> We just started conversations about how a Parquet 2.0 release may look
> >>> in the meeting yesterday. As these conversations progress, the dev list
> >>> will be kept updated.
> >>>
> >>>
> >>> On Wed, Apr 24, 2024 at 11:10 AM Prem Sahoo <[email protected]>
> >>> wrote:
> >>>
> >>>> Hello Vinoo/Team,
> >>>> As per pyarrow Team , They  don't see any concern , please check
> below.
> >>>> Please let us know *where it says Parquet V2 is not official *
> >>>>
> >>>> "> *As per Apache Parquet Community Parquet V2 is not final yet so it
> >>>> is not
> >>>> > official . They are advising not to use Parquet V2 for writing
> (though
> >>>> code
> >>>> > is available ) .*
> >>>>
> >>>> This would be news to me.  Parquet releases are listed (by the parquet
> >>>> community) at [1]
> >>>>
> >>>> The vote to release parquet 2.10 is here: [2]
> >>>>
> >>>>
> >>>> *Neither of these links mention anything about this being an
> >>>> experimental,unofficial, or non-finalized release.*
> >>>>
> >>>> I understand your concern.  I believe your quotes are coming from your
> >>>> discussion on the parquet mailing list here [3].  This communication
> is
> >>>> unfortunate and confusing to me as well.
> >>>>
> >>>> [1] https://parquet.apache.org/blog/
> >>>> [2] https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
> >>>> [3] https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3";
> >>>>
> >>>>
> >>>> On Mon, Apr 22, 2024 at 4:56 PM Prem Sahoo <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hello Vinoo/Team,.
> >>>>> I was going through pyarrow and they have started using V2 as default
> >>>>> . isn't it they should avoid it as it is not official.
> >>>>>
> >>>>>
> >>>>>
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> >>>>>
> >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6”
> >>>>>
> >>>>> Determine which Parquet logical types are available for use, whether
> >>>>> the reduced set from the Parquet 1.x.x format or the expanded
> logical types
> >>>>> added in later format versions. Files written with version=’2.4’ or
> ‘2.6’
> >>>>> may not be readable in all Parquet implementations, so version=’1.0’
> is
> >>>>> likely the choice that maximizes file compatibility. UINT32 and some
> >>>>> logical types are only available with version ‘2.4’. Nanosecond
> timestamps
> >>>>> are only available with version ‘2.6’. Other features such as
> compression
> >>>>> algorithms or the new serialized data page format must be enabled
> >>>>> separately (see ‘compression’ and ‘data_page_version’).
> >>>>>
> >>>>
>

Reply via email to