It supports writing v2, but defaults to v1. hadoopConfiguration.set(“parquet.writer.version”, “v2”)
Best regards, Adam Lippai On Wed, Apr 24, 2024 at 11:40 Prem Sahoo <prem.re...@gmail.com> wrote: > They do support Reading of Parquet V2 , but writing is not supported by > Spark for V2. > > On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai <a...@rigo.sk> wrote: > > > Hi Wes, > > > > As far as I remember hive, spark, impala, duckdb or even proprietary > > systems like hyper, Vertica all support reading data page v2 now. The > most > > recent column encodings (BYTE_STREAM_SPLIT) might be missing, but overall > > the support seems much better than a year or two ago. > > > > Best regards, > > Adam Lippai > > > > On Wed, Apr 24, 2024 at 10:51 Wes McKinney <wesmck...@gmail.com> wrote: > > > > > I think there is confusion about the Parquet "V2" (including the V2 > data > > > pages, and other details) and the 2.x.y releases of the format library > > > artifact. They aren't the same unfortunately. I don't think the V2 > > metadata > > > structures (the data pages in particular, and new column encoding) is > > > widely adopted / readable. > > > > > > On Wed, Apr 24, 2024 at 9:32 AM Weston Pace <weston.p...@gmail.com> > > wrote: > > > > > > > > *As per Apache Parquet Community Parquet V2 is not final yet so it > is > > > not > > > > > official . They are advising not to use Parquet V2 for writing > > (though > > > > code > > > > > is available ) .* > > > > > > > > This would be news to me. Parquet releases are listed (by the > parquet > > > > community) at [1] > > > > > > > > The vote to release parquet 2.10 is here: [2] > > > > > > > > Neither of these links mention anything about this being an > > experimental, > > > > unofficial, or non-finalized release. > > > > > > > > I understand your concern. I believe your quotes are coming from > your > > > > discussion on the parquet mailing list here [3]. This communication > is > > > > unfortunate and confusing to me as well. > > > > > > > > [1] https://parquet.apache.org/blog/ > > > > [2] https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6 > > > > [3] https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3 > > > > > > > > > > > > On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo <prem.re...@gmail.com> > > wrote: > > > > > > > > > Hello Jacob, > > > > > Thanks for the information, and my apologies for the weird format > of > > my > > > > > email. > > > > > > > > > > This is the email from the Parquet community. May I know why > pyarrow > > is > > > > > using Parquet V2 which is not official yet ? > > > > > > > > > > My question is from Parquet community V2 is not final yet so it is > > not > > > > > official yet. > > > > > "Hi Prem - Maybe I can help clarify to the best of my knowledge. > > > Parquet > > > > V2 > > > > > as a standard isn't finalized just yet. Meaning there is no formal, > > > > > *finalized* "contract" that specifies what it means to write data > in > > > the > > > > V2 > > > > > version. The discussions/conversations about what the final V2 > > standard > > > > may > > > > > be are still in progress and are evolving. > > > > > > > > > > That being said, because V2 code does exist (though unfinalized), > > there > > > > are > > > > > clients / tools that are writing data in the un-finalized V2 > format, > > as > > > > > seems to be the case with Dremio. > > > > > > > > > > Now, as that comment you quoted said, you can have Spark write V2 > > > files, > > > > > but it's worth being mindful about the fact that V2 is a moving > > target > > > > and > > > > > can (and likely will) change. You can overwrite > > parquet.writer.version > > > to > > > > > specify your desired version, but it can be dangerous to produce > data > > > in > > > > a > > > > > moving-target format. For example, let's say you write a bunch of > > data > > > in > > > > > Parquet V2, and then the community decides to make a breaking > change > > > > (which > > > > > is completely fine / allowed since V2 isn't finalized). You are now > > > left > > > > > having to deal with a potentially large and complicated file format > > > > update. > > > > > That's why it's not recommended to write files in parquet v2 just > > yet." > > > > > > > > > > > > > > > *As per Apache Parquet Community Parquet V2 is not final yet so it > is > > > not > > > > > official . They are advising not to use Parquet V2 for writing > > (though > > > > code > > > > > is available ) .* > > > > > > > > > > > > > > > *As per above Spark hasn't started using Parquet V2 for writing *. > > > > > > > > > > May I know how an unstable /unofficial version is being used in > > > pyarrow > > > > ? > > > > > > > > > > > > > > > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak < > > assignu...@apache.org> > > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > First off, please try to clean up formating of emails to be > legible > > > > when > > > > > > forwarding/quoting previous messages multiple times, especially > > when > > > > most > > > > > > of the quotes do not contain any useful information. It makes it > > much > > > > > > easier to parse the message and thus quicker to answer. > > > > > > > > > > > > The short answer is that we switched to 2.4 and more recently to > > 2.6 > > > as > > > > > > the default to enable the usage of features these versions > provide. > > > As > > > > > you > > > > > > have correctly quoted from the docs you can still write 1.0 if > you > > > want > > > > > to > > > > > > ensure compatibility with systems that can not process the > 'newer' > > > > > versions > > > > > > yet (2.6 was released in 2018!). > > > > > > > > > > > > You can find the long form discussions about these changes here: > > > > > > https://issues.apache.org/jira/browse/ARROW-12203 > > > > > > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm > > > > > > > > > > > > Best > > > > > > Jacob > > > > > > > > > > > > On 2024/04/24 02:32:01 Prem Sahoo wrote: > > > > > > > Hello Team, > > > > > > > Could you please share your thoughts about below questions? > > > > > > > Sent from my iPhone > > > > > > > > > > > > > > Begin forwarded message: > > > > > > > > > > > > > > > From: Prem Sahoo <prem.re...@gmail.com> > > > > > > > > Date: April 23, 2024 at 11:03:48 AM EDT > > > > > > > > To: dev-ow...@arrow.apache.org > > > > > > > > Subject: Re: PyArrow Using Parquet V2 > > > > > > > > > > > > > > > > dev@arrow.apache.org > > > > > > > > Sent from my iPhone > > > > > > > > > > > > > > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo < > > prem.re...@gmail.com> > > > > > > wrote: > > > > > > > >>> > > > > > > > >> Hello Team, > > > > > > > >> Could anyone please help me on below query? > > > > > > > >> Sent from my iPhone > > > > > > > >> > > > > > > > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo < > > > prem.re...@gmail.com> > > > > > > wrote: > > > > > > > >>>> > > > > > > > >>> > > > > > > > >>> Sent from my iPhone > > > > > > > >>> > > > > > > > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo < > > > prem.re...@gmail.com> > > > > > > wrote: > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>>> > > > > > > > >>>>> > > > > > > > >>>>> Hello Team, > > > > > > > >>>>> I have a question regarding Parquet V2 writing thro > > pyarrow . > > > > > > > >>>>> As per below Pyarrow started writing Parquet in V2 > > encoding. > > > > > > > >>>>> > > > > > > > > > > > > > > > > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table > > > > > > > >>>>> > > > > > > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6” > > > > > > > >>>>> Determine which Parquet logical types are available for > > use, > > > > > > whether the reduced set from the Parquet 1.x.x format or the > > expanded > > > > > > logical types added in later format versions. Files written with > > > > > > version=’2.4’ or ‘2.6’ may not be readable in all Parquet > > > > > implementations, > > > > > > so version=’1.0’ is likely the choice that maximizes file > > > > compatibility. > > > > > > UINT32 and some logical types are only available with version > > ‘2.4’. > > > > > > Nanosecond timestamps are only available with version ‘2.6’. > Other > > > > > features > > > > > > such as compression algorithms or the new serialized data page > > format > > > > > must > > > > > > be enabled separately (see ‘compression’ and > ‘data_page_version’). > > > > > > > >>>>> > > > > > > > >>>>> > > > > > > > >>>>> As per Apache Parquet Community Parquet V2 is not final > yet > > > so > > > > it > > > > > > is not official . They are advising not to use Parquet V2 for > > writing > > > > > > (though code is available ) . > > > > > > > >>>>> > > > > > > > >>>>> As per above Spark hasn't started using Parquet V2 for > > > writing > > > > . > > > > > > > >>>>> May I know how an unstable /unofficial version is being > > used > > > > in > > > > > > pyarrow ? > > > > > > > >>>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > >