Hi Gabor - that's a good point. When I say unstable, I don't mean the code
itself, I mean universal "compatibility" of the files produced (ex. Prem's
case of compatibility of Parquet files produced/consumed by Spark +
Dremio).
<vinoo.gan...@gmail.com>

In the last Parquet meeting, I brought up discussing / planning for a
parquet-mr 2.0 release which I think should at least establish a parquet-mr
release as the "formal implementation" of the standard (even if it's mostly
a vanity release).

On Thu, Apr 25, 2024 at 9:36 AM Gábor Szádovszky <ga...@apache.org> wrote:

> Hey,
>
> I don't think we should call Parquet v2.x features unstable. Since they
> were released officially, we maintain backward compatibility. So, from
> Parquet format point of view, these features are stable.
> It is another question whether a Parquet implementation supports all of
> these features or only a subset of them. I think, parquet-mr and
> parquet-cpp (Arrow) are keeping up well with these features. Other
> implementation (e.g. Impala) might be lagging behind.
> I agree it is very hard for the implementations to implement everything or
> choose what is really required. There was an initiative a couple of years
> ago that I've started but failed to finish. See
> https://github.com/apache/parquet-format/pull/164 for details.
>
> I think the main question is what systems do you create your parquet files
> for. If you can list these systems (e.g. Spark, Hive etc.) you can
> validate if the files are working with them properly. In many cases
> parquet-mr or parquet-cpp are the actual implementation behind. If you want
> to create parquet files for any systems to read, you should not use newer
> features. (The encodings we are talking about as v2 encodings are 10+ years
> old in the Parquet spec.) But keep in mind that in many cases it is not
> that simple. For example compression codecs might be supported in a system
> or not independently from the actual Parquet implementation. For parquet-mr
> it is expected to have the related native libraries installed for some
> codecs.
>
> Cheers,
> Gabor
>
> Prem Sahoo <prem.re...@gmail.com> ezt írta (időpont: 2024. ápr. 24., Sze,
> 20:10):
>
> > Hello Vinoo,
> > Thanks for your assistance . Pyarrow folks are using Parquet V2 though it
> > is not recommended . I don't want to make any mess so I am just checking
> > with all different groups .
> >
> > On Wed, Apr 24, 2024 at 12:31 PM Vinoo Ganesh <vinoo.gan...@gmail.com>
> > wrote:
> >
> > > I'm not sure what you're looking for. A few different folks (Ryan/Steve
> > on
> > > the Spark list, Wes on the Arrow list, and Gang/me on the Parquet list)
> > > have said that they wouldn't recommend using the Parquet V2 encodings,
> > but
> > > you're free to do whatever you want in your own data stack, as are the
> > > clients who are using Parquet V2. Again, I (and others) personally
> > wouldn't
> > > recommend storing production data in an unstable format, and that's the
> > > reason we are warning against it.
> > >
> > > On Wed, Apr 24, 2024 at 11:47 AM Prem Sahoo <prem.re...@gmail.com>
> > wrote:
> > >
> > >> Hello Vinoo,
> > >> Can you please share a link where it says Parquet V2 is not official
> or
> > >> not stable for use by third parties ?
> > >>
> > >>
> > >> On Wed, Apr 24, 2024 at 11:28 AM Vinoo Ganesh <vinoo.gan...@gmail.com
> >
> > >> wrote:
> > >>
> > >>> Hi Prem, Wes' comment on the thread you posted on the arrow dev list
> > >>> should clear up your confusion:
> > >>> https://lists.apache.org/thread/72qwr66wf3xyrl5cozgojz88ct23qzxx.
> > There
> > >>> is a difference between the "standard" itself (parquet-format) and
> the
> > >>> implementation (parquet-mr, etc...).
> > >>>
> > >>> Parquet-format (https://github.com/apache/parquet-format) contains
> > >>> mostly just the docs and thrift definition now that a PR to clean up
> > the
> > >>> remaining deprecated code was just merged. Releases of this just
> > format,
> > >>> which again, is mostly just docs, is what Gang was referring to in
> [2].
> > >>>
> > >>> We just started conversations about how a Parquet 2.0 release may
> look
> > >>> in the meeting yesterday. As these conversations progress, the dev
> list
> > >>> will be kept updated.
> > >>>
> > >>>
> > >>> On Wed, Apr 24, 2024 at 11:10 AM Prem Sahoo <prem.re...@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> Hello Vinoo/Team,
> > >>>> As per pyarrow Team , They  don't see any concern , please check
> > below.
> > >>>> Please let us know *where it says Parquet V2 is not official *
> > >>>>
> > >>>> "> *As per Apache Parquet Community Parquet V2 is not final yet so
> it
> > >>>> is not
> > >>>> > official . They are advising not to use Parquet V2 for writing
> > (though
> > >>>> code
> > >>>> > is available ) .*
> > >>>>
> > >>>> This would be news to me.  Parquet releases are listed (by the
> parquet
> > >>>> community) at [1]
> > >>>>
> > >>>> The vote to release parquet 2.10 is here: [2]
> > >>>>
> > >>>>
> > >>>> *Neither of these links mention anything about this being an
> > >>>> experimental,unofficial, or non-finalized release.*
> > >>>>
> > >>>> I understand your concern.  I believe your quotes are coming from
> your
> > >>>> discussion on the parquet mailing list here [3].  This communication
> > is
> > >>>> unfortunate and confusing to me as well.
> > >>>>
> > >>>> [1] https://parquet.apache.org/blog/
> > >>>> [2]
> https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
> > >>>> [3]
> https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3";
> > >>>>
> > >>>>
> > >>>> On Mon, Apr 22, 2024 at 4:56 PM Prem Sahoo <prem.re...@gmail.com>
> > >>>> wrote:
> > >>>>
> > >>>>> Hello Vinoo/Team,.
> > >>>>> I was going through pyarrow and they have started using V2 as
> default
> > >>>>> . isn't it they should avoid it as it is not official.
> > >>>>>
> > >>>>>
> > >>>>>
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> > >>>>>
> > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6”
> > >>>>>
> > >>>>> Determine which Parquet logical types are available for use,
> whether
> > >>>>> the reduced set from the Parquet 1.x.x format or the expanded
> > logical types
> > >>>>> added in later format versions. Files written with version=’2.4’ or
> > ‘2.6’
> > >>>>> may not be readable in all Parquet implementations, so
> version=’1.0’
> > is
> > >>>>> likely the choice that maximizes file compatibility. UINT32 and
> some
> > >>>>> logical types are only available with version ‘2.4’. Nanosecond
> > timestamps
> > >>>>> are only available with version ‘2.6’. Other features such as
> > compression
> > >>>>> algorithms or the new serialized data page format must be enabled
> > >>>>> separately (see ‘compression’ and ‘data_page_version’).
> > >>>>>
> > >>>>
> >
>

Reply via email to