As an outsider I suspect the only reason for these “common beliefs” is that
Spark simply doesn’t support some of the breaking features (eg the
nanoseconds data type). Maybe closing the very few gaps would resolve the
issue for good.

Best regards,
Adam Lippai

On Wed, Apr 24, 2024 at 10:32 Weston Pace <weston.p...@gmail.com> wrote:

> > *As per Apache Parquet Community Parquet V2 is not final yet so it is not
> > official . They are advising not to use Parquet V2 for writing (though
> code
> > is available ) .*
>
> This would be news to me.  Parquet releases are listed (by the parquet
> community) at [1]
>
> The vote to release parquet 2.10 is here: [2]
>
> Neither of these links mention anything about this being an experimental,
> unofficial, or non-finalized release.
>
> I understand your concern.  I believe your quotes are coming from your
> discussion on the parquet mailing list here [3].  This communication is
> unfortunate and confusing to me as well.
>
> [1] https://parquet.apache.org/blog/
> [2] https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
> [3] https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3
>
>
> On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo <prem.re...@gmail.com> wrote:
>
> > Hello Jacob,
> > Thanks for the information, and my apologies for the weird format of my
> > email.
> >
> > This is the email from the Parquet community. May I know why pyarrow is
> > using Parquet V2 which is not official yet ?
> >
> > My question is from Parquet community V2 is not final yet so it is not
> > official yet.
> > "Hi Prem - Maybe I can help clarify to the best of my knowledge. Parquet
> V2
> > as a standard isn't finalized just yet. Meaning there is no formal,
> > *finalized* "contract" that specifies what it means to write data in the
> V2
> > version. The discussions/conversations about what the final V2 standard
> may
> > be are still in progress and are evolving.
> >
> > That being said, because V2 code does exist (though unfinalized), there
> are
> > clients / tools that are writing data in the un-finalized V2 format, as
> > seems to be the case with Dremio.
> >
> > Now, as that comment you quoted said, you can have Spark write V2 files,
> > but it's worth being mindful about the fact that V2 is a moving target
> and
> > can (and likely will) change. You can overwrite parquet.writer.version to
> > specify your desired version, but it can be dangerous to produce data in
> a
> > moving-target format. For example, let's say you write a bunch of data in
> > Parquet V2, and then the community decides to make a breaking change
> (which
> > is completely fine / allowed since V2 isn't finalized). You are now left
> > having to deal with a potentially large and complicated file format
> update.
> > That's why it's not recommended to write files in parquet v2 just yet."
> >
> >
> > *As per Apache Parquet Community Parquet V2 is not final yet so it is not
> > official . They are advising not to use Parquet V2 for writing (though
> code
> > is available ) .*
> >
> >
> > *As per above Spark hasn't started using Parquet V2 for writing *.
> >
> > May I know how an unstable /unofficial  version is being used in pyarrow
> ?
> >
> >
> > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak <assignu...@apache.org>
> > wrote:
> >
> > > Hello,
> > >
> > > First off, please try to clean up formating of emails to be legible
> when
> > > forwarding/quoting previous messages multiple times, especially when
> most
> > > of the quotes do not contain any useful information. It makes it much
> > > easier to parse the message and thus quicker to answer.
> > >
> > > The short answer is that we switched to 2.4 and more recently to 2.6 as
> > > the default to enable the usage of features these versions provide. As
> > you
> > > have correctly quoted from the docs you can still write 1.0 if you want
> > to
> > > ensure compatibility with systems that can not process the 'newer'
> > versions
> > > yet (2.6 was released in 2018!).
> > >
> > > You can find the long form discussions about these changes here:
> > > https://issues.apache.org/jira/browse/ARROW-12203
> > > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm
> > >
> > > Best
> > > Jacob
> > >
> > > On 2024/04/24 02:32:01 Prem Sahoo wrote:
> > > > Hello Team,
> > > > Could you please share your thoughts about below questions?
> > > > Sent from my iPhone
> > > >
> > > > Begin forwarded message:
> > > >
> > > > > From: Prem Sahoo <prem.re...@gmail.com>
> > > > > Date: April 23, 2024 at 11:03:48 AM EDT
> > > > > To: dev-ow...@arrow.apache.org
> > > > > Subject: Re: PyArrow Using Parquet V2
> > > > >
> > > > > dev@arrow.apache.org
> > > > > Sent from my iPhone
> > > > >
> > > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo <prem.re...@gmail.com>
> > > wrote:
> > > > >>>
> > > > >> Hello Team,
> > > > >> Could anyone please help me on below query?
> > > > >> Sent from my iPhone
> > > > >>
> > > > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo <prem.re...@gmail.com>
> > > wrote:
> > > > >>>>
> > > > >>> 
> > > > >>> Sent from my iPhone
> > > > >>>
> > > > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo <prem.re...@gmail.com>
> > > wrote:
> > > > >>>>>
> > > > >>>> 
> > > > >>>>
> > > > >>>>>
> > > > >>>>> 
> > > > >>>>> Hello Team,
> > > > >>>>> I have a question regarding Parquet V2 writing thro pyarrow .
> > > > >>>>> As per below Pyarrow started writing Parquet in V2 encoding.
> > > > >>>>>
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> > > > >>>>>
> > > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6”
> > > > >>>>> Determine which Parquet logical types are available for use,
> > > whether the reduced set from the Parquet 1.x.x format or the expanded
> > > logical types added in later format versions. Files written with
> > > version=’2.4’ or ‘2.6’ may not be readable in all Parquet
> > implementations,
> > > so version=’1.0’ is likely the choice that maximizes file
> compatibility.
> > > UINT32 and some logical types are only available with version ‘2.4’.
> > > Nanosecond timestamps are only available with version ‘2.6’. Other
> > features
> > > such as compression algorithms or the new serialized data page format
> > must
> > > be enabled separately (see ‘compression’ and ‘data_page_version’).
> > > > >>>>>
> > > > >>>>>
> > > > >>>>> As per Apache Parquet Community Parquet V2 is not final yet so
> it
> > > is not official . They are advising not to use Parquet V2 for writing
> > > (though code is available ) .
> > > > >>>>>
> > > > >>>>> As per above Spark hasn't started using Parquet V2 for writing
> .
> > > > >>>>> May I know how an unstable /unofficial  version is being used
> in
> > > pyarrow ?
> > > > >>>>>
> > > >
> > >
> >
>

Reply via email to