I think there is confusion about the Parquet "V2" (including the V2 data pages, and other details) and the 2.x.y releases of the format library artifact. They aren't the same unfortunately. I don't think the V2 metadata structures (the data pages in particular, and new column encoding) is widely adopted / readable.
On Wed, Apr 24, 2024 at 9:32 AM Weston Pace <weston.p...@gmail.com> wrote: > > *As per Apache Parquet Community Parquet V2 is not final yet so it is not > > official . They are advising not to use Parquet V2 for writing (though > code > > is available ) .* > > This would be news to me. Parquet releases are listed (by the parquet > community) at [1] > > The vote to release parquet 2.10 is here: [2] > > Neither of these links mention anything about this being an experimental, > unofficial, or non-finalized release. > > I understand your concern. I believe your quotes are coming from your > discussion on the parquet mailing list here [3]. This communication is > unfortunate and confusing to me as well. > > [1] https://parquet.apache.org/blog/ > [2] https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6 > [3] https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3 > > > On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo <prem.re...@gmail.com> wrote: > > > Hello Jacob, > > Thanks for the information, and my apologies for the weird format of my > > email. > > > > This is the email from the Parquet community. May I know why pyarrow is > > using Parquet V2 which is not official yet ? > > > > My question is from Parquet community V2 is not final yet so it is not > > official yet. > > "Hi Prem - Maybe I can help clarify to the best of my knowledge. Parquet > V2 > > as a standard isn't finalized just yet. Meaning there is no formal, > > *finalized* "contract" that specifies what it means to write data in the > V2 > > version. The discussions/conversations about what the final V2 standard > may > > be are still in progress and are evolving. > > > > That being said, because V2 code does exist (though unfinalized), there > are > > clients / tools that are writing data in the un-finalized V2 format, as > > seems to be the case with Dremio. > > > > Now, as that comment you quoted said, you can have Spark write V2 files, > > but it's worth being mindful about the fact that V2 is a moving target > and > > can (and likely will) change. You can overwrite parquet.writer.version to > > specify your desired version, but it can be dangerous to produce data in > a > > moving-target format. For example, let's say you write a bunch of data in > > Parquet V2, and then the community decides to make a breaking change > (which > > is completely fine / allowed since V2 isn't finalized). You are now left > > having to deal with a potentially large and complicated file format > update. > > That's why it's not recommended to write files in parquet v2 just yet." > > > > > > *As per Apache Parquet Community Parquet V2 is not final yet so it is not > > official . They are advising not to use Parquet V2 for writing (though > code > > is available ) .* > > > > > > *As per above Spark hasn't started using Parquet V2 for writing *. > > > > May I know how an unstable /unofficial version is being used in pyarrow > ? > > > > > > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak <assignu...@apache.org> > > wrote: > > > > > Hello, > > > > > > First off, please try to clean up formating of emails to be legible > when > > > forwarding/quoting previous messages multiple times, especially when > most > > > of the quotes do not contain any useful information. It makes it much > > > easier to parse the message and thus quicker to answer. > > > > > > The short answer is that we switched to 2.4 and more recently to 2.6 as > > > the default to enable the usage of features these versions provide. As > > you > > > have correctly quoted from the docs you can still write 1.0 if you want > > to > > > ensure compatibility with systems that can not process the 'newer' > > versions > > > yet (2.6 was released in 2018!). > > > > > > You can find the long form discussions about these changes here: > > > https://issues.apache.org/jira/browse/ARROW-12203 > > > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm > > > > > > Best > > > Jacob > > > > > > On 2024/04/24 02:32:01 Prem Sahoo wrote: > > > > Hello Team, > > > > Could you please share your thoughts about below questions? > > > > Sent from my iPhone > > > > > > > > Begin forwarded message: > > > > > > > > > From: Prem Sahoo <prem.re...@gmail.com> > > > > > Date: April 23, 2024 at 11:03:48 AM EDT > > > > > To: dev-ow...@arrow.apache.org > > > > > Subject: Re: PyArrow Using Parquet V2 > > > > > > > > > > dev@arrow.apache.org > > > > > Sent from my iPhone > > > > > > > > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo <prem.re...@gmail.com> > > > wrote: > > > > >>> > > > > >> Hello Team, > > > > >> Could anyone please help me on below query? > > > > >> Sent from my iPhone > > > > >> > > > > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo <prem.re...@gmail.com> > > > wrote: > > > > >>>> > > > > >>> > > > > >>> Sent from my iPhone > > > > >>> > > > > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo <prem.re...@gmail.com> > > > wrote: > > > > >>>>> > > > > >>>> > > > > >>>> > > > > >>>>> > > > > >>>>> > > > > >>>>> Hello Team, > > > > >>>>> I have a question regarding Parquet V2 writing thro pyarrow . > > > > >>>>> As per below Pyarrow started writing Parquet in V2 encoding. > > > > >>>>> > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table > > > > >>>>> > > > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6” > > > > >>>>> Determine which Parquet logical types are available for use, > > > whether the reduced set from the Parquet 1.x.x format or the expanded > > > logical types added in later format versions. Files written with > > > version=’2.4’ or ‘2.6’ may not be readable in all Parquet > > implementations, > > > so version=’1.0’ is likely the choice that maximizes file > compatibility. > > > UINT32 and some logical types are only available with version ‘2.4’. > > > Nanosecond timestamps are only available with version ‘2.6’. Other > > features > > > such as compression algorithms or the new serialized data page format > > must > > > be enabled separately (see ‘compression’ and ‘data_page_version’). > > > > >>>>> > > > > >>>>> > > > > >>>>> As per Apache Parquet Community Parquet V2 is not final yet so > it > > > is not official . They are advising not to use Parquet V2 for writing > > > (though code is available ) . > > > > >>>>> > > > > >>>>> As per above Spark hasn't started using Parquet V2 for writing > . > > > > >>>>> May I know how an unstable /unofficial version is being used > in > > > pyarrow ? > > > > >>>>> > > > > > > > > > >