I tried with this option but spark is not creating V2 parquet. as I can still see "format_version: 1.0" . I think it needs something else too.
On Wed, Apr 24, 2024 at 12:33 PM Adam Lippai <a...@rigo.sk> wrote: > It supports writing v2, but defaults to v1. > hadoopConfiguration.set(“parquet.writer.version”, “v2”) > > Best regards, > Adam Lippai > > > On Wed, Apr 24, 2024 at 11:40 Prem Sahoo <prem.re...@gmail.com> wrote: > > > They do support Reading of Parquet V2 , but writing is not supported by > > Spark for V2. > > > > On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai <a...@rigo.sk> wrote: > > > > > Hi Wes, > > > > > > As far as I remember hive, spark, impala, duckdb or even proprietary > > > systems like hyper, Vertica all support reading data page v2 now. The > > most > > > recent column encodings (BYTE_STREAM_SPLIT) might be missing, but > overall > > > the support seems much better than a year or two ago. > > > > > > Best regards, > > > Adam Lippai > > > > > > On Wed, Apr 24, 2024 at 10:51 Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > > > I think there is confusion about the Parquet "V2" (including the V2 > > data > > > > pages, and other details) and the 2.x.y releases of the format > library > > > > artifact. They aren't the same unfortunately. I don't think the V2 > > > metadata > > > > structures (the data pages in particular, and new column encoding) is > > > > widely adopted / readable. > > > > > > > > On Wed, Apr 24, 2024 at 9:32 AM Weston Pace <weston.p...@gmail.com> > > > wrote: > > > > > > > > > > *As per Apache Parquet Community Parquet V2 is not final yet so > it > > is > > > > not > > > > > > official . They are advising not to use Parquet V2 for writing > > > (though > > > > > code > > > > > > is available ) .* > > > > > > > > > > This would be news to me. Parquet releases are listed (by the > > parquet > > > > > community) at [1] > > > > > > > > > > The vote to release parquet 2.10 is here: [2] > > > > > > > > > > Neither of these links mention anything about this being an > > > experimental, > > > > > unofficial, or non-finalized release. > > > > > > > > > > I understand your concern. I believe your quotes are coming from > > your > > > > > discussion on the parquet mailing list here [3]. This > communication > > is > > > > > unfortunate and confusing to me as well. > > > > > > > > > > [1] https://parquet.apache.org/blog/ > > > > > [2] > https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6 > > > > > [3] > https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3 > > > > > > > > > > > > > > > On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo <prem.re...@gmail.com> > > > wrote: > > > > > > > > > > > Hello Jacob, > > > > > > Thanks for the information, and my apologies for the weird format > > of > > > my > > > > > > email. > > > > > > > > > > > > This is the email from the Parquet community. May I know why > > pyarrow > > > is > > > > > > using Parquet V2 which is not official yet ? > > > > > > > > > > > > My question is from Parquet community V2 is not final yet so it > is > > > not > > > > > > official yet. > > > > > > "Hi Prem - Maybe I can help clarify to the best of my knowledge. > > > > Parquet > > > > > V2 > > > > > > as a standard isn't finalized just yet. Meaning there is no > formal, > > > > > > *finalized* "contract" that specifies what it means to write data > > in > > > > the > > > > > V2 > > > > > > version. The discussions/conversations about what the final V2 > > > standard > > > > > may > > > > > > be are still in progress and are evolving. > > > > > > > > > > > > That being said, because V2 code does exist (though unfinalized), > > > there > > > > > are > > > > > > clients / tools that are writing data in the un-finalized V2 > > format, > > > as > > > > > > seems to be the case with Dremio. > > > > > > > > > > > > Now, as that comment you quoted said, you can have Spark write V2 > > > > files, > > > > > > but it's worth being mindful about the fact that V2 is a moving > > > target > > > > > and > > > > > > can (and likely will) change. You can overwrite > > > parquet.writer.version > > > > to > > > > > > specify your desired version, but it can be dangerous to produce > > data > > > > in > > > > > a > > > > > > moving-target format. For example, let's say you write a bunch of > > > data > > > > in > > > > > > Parquet V2, and then the community decides to make a breaking > > change > > > > > (which > > > > > > is completely fine / allowed since V2 isn't finalized). You are > now > > > > left > > > > > > having to deal with a potentially large and complicated file > format > > > > > update. > > > > > > That's why it's not recommended to write files in parquet v2 just > > > yet." > > > > > > > > > > > > > > > > > > *As per Apache Parquet Community Parquet V2 is not final yet so > it > > is > > > > not > > > > > > official . They are advising not to use Parquet V2 for writing > > > (though > > > > > code > > > > > > is available ) .* > > > > > > > > > > > > > > > > > > *As per above Spark hasn't started using Parquet V2 for writing > *. > > > > > > > > > > > > May I know how an unstable /unofficial version is being used in > > > > pyarrow > > > > > ? > > > > > > > > > > > > > > > > > > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak < > > > assignu...@apache.org> > > > > > > wrote: > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > First off, please try to clean up formating of emails to be > > legible > > > > > when > > > > > > > forwarding/quoting previous messages multiple times, especially > > > when > > > > > most > > > > > > > of the quotes do not contain any useful information. It makes > it > > > much > > > > > > > easier to parse the message and thus quicker to answer. > > > > > > > > > > > > > > The short answer is that we switched to 2.4 and more recently > to > > > 2.6 > > > > as > > > > > > > the default to enable the usage of features these versions > > provide. > > > > As > > > > > > you > > > > > > > have correctly quoted from the docs you can still write 1.0 if > > you > > > > want > > > > > > to > > > > > > > ensure compatibility with systems that can not process the > > 'newer' > > > > > > versions > > > > > > > yet (2.6 was released in 2018!). > > > > > > > > > > > > > > You can find the long form discussions about these changes > here: > > > > > > > https://issues.apache.org/jira/browse/ARROW-12203 > > > > > > > > https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm > > > > > > > > > > > > > > Best > > > > > > > Jacob > > > > > > > > > > > > > > On 2024/04/24 02:32:01 Prem Sahoo wrote: > > > > > > > > Hello Team, > > > > > > > > Could you please share your thoughts about below questions? > > > > > > > > Sent from my iPhone > > > > > > > > > > > > > > > > Begin forwarded message: > > > > > > > > > > > > > > > > > From: Prem Sahoo <prem.re...@gmail.com> > > > > > > > > > Date: April 23, 2024 at 11:03:48 AM EDT > > > > > > > > > To: dev-ow...@arrow.apache.org > > > > > > > > > Subject: Re: PyArrow Using Parquet V2 > > > > > > > > > > > > > > > > > > dev@arrow.apache.org > > > > > > > > > Sent from my iPhone > > > > > > > > > > > > > > > > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo < > > > prem.re...@gmail.com> > > > > > > > wrote: > > > > > > > > >>> > > > > > > > > >> Hello Team, > > > > > > > > >> Could anyone please help me on below query? > > > > > > > > >> Sent from my iPhone > > > > > > > > >> > > > > > > > > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo < > > > > prem.re...@gmail.com> > > > > > > > wrote: > > > > > > > > >>>> > > > > > > > > >>> > > > > > > > > >>> Sent from my iPhone > > > > > > > > >>> > > > > > > > > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo < > > > > prem.re...@gmail.com> > > > > > > > wrote: > > > > > > > > >>>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > >>>>> > > > > > > > > >>>>> > > > > > > > > >>>>> Hello Team, > > > > > > > > >>>>> I have a question regarding Parquet V2 writing thro > > > pyarrow . > > > > > > > > >>>>> As per below Pyarrow started writing Parquet in V2 > > > encoding. > > > > > > > > >>>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table > > > > > > > > >>>>> > > > > > > > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6” > > > > > > > > >>>>> Determine which Parquet logical types are available for > > > use, > > > > > > > whether the reduced set from the Parquet 1.x.x format or the > > > expanded > > > > > > > logical types added in later format versions. Files written > with > > > > > > > version=’2.4’ or ‘2.6’ may not be readable in all Parquet > > > > > > implementations, > > > > > > > so version=’1.0’ is likely the choice that maximizes file > > > > > compatibility. > > > > > > > UINT32 and some logical types are only available with version > > > ‘2.4’. > > > > > > > Nanosecond timestamps are only available with version ‘2.6’. > > Other > > > > > > features > > > > > > > such as compression algorithms or the new serialized data page > > > format > > > > > > must > > > > > > > be enabled separately (see ‘compression’ and > > ‘data_page_version’). > > > > > > > > >>>>> > > > > > > > > >>>>> > > > > > > > > >>>>> As per Apache Parquet Community Parquet V2 is not final > > yet > > > > so > > > > > it > > > > > > > is not official . They are advising not to use Parquet V2 for > > > writing > > > > > > > (though code is available ) . > > > > > > > > >>>>> > > > > > > > > >>>>> As per above Spark hasn't started using Parquet V2 for > > > > writing > > > > > . > > > > > > > > >>>>> May I know how an unstable /unofficial version is > being > > > used > > > > > in > > > > > > > pyarrow ? > > > > > > > > >>>>> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >