Re: Fwd: PyArrow Using Parquet V2

Prem Sahoo Wed, 24 Apr 2024 12:51:15 -0700

I tried with this option but spark is not creating V2 parquet. as I can
still see "format_version: 1.0" . I think it needs something else too.


On Wed, Apr 24, 2024 at 12:33 PM Adam Lippai <a...@rigo.sk> wrote:

> It supports writing v2, but defaults to v1.
> hadoopConfiguration.set(“parquet.writer.version”, “v2”)
>
> Best regards,
> Adam Lippai
>
>
> On Wed, Apr 24, 2024 at 11:40 Prem Sahoo <prem.re...@gmail.com> wrote:
>
> > They do support Reading of Parquet V2 , but writing is not supported by
> > Spark for V2.
> >
> > On Wed, Apr 24, 2024 at 11:10 AM Adam Lippai <a...@rigo.sk> wrote:
> >
> > > Hi Wes,
> > >
> > > As far as I remember hive, spark, impala, duckdb or even proprietary
> > > systems like hyper, Vertica all support reading data page v2 now. The
> > most
> > > recent column encodings (BYTE_STREAM_SPLIT) might be missing, but
> overall
> > > the support seems much better than a year or two ago.
> > >
> > > Best regards,
> > > Adam Lippai
> > >
> > > On Wed, Apr 24, 2024 at 10:51 Wes McKinney <wesmck...@gmail.com>
> wrote:
> > >
> > > > I think there is confusion about the Parquet "V2" (including the V2
> > data
> > > > pages, and other details) and the 2.x.y releases of the format
> library
> > > > artifact. They aren't the same unfortunately. I don't think the V2
> > > metadata
> > > > structures (the data pages in particular, and new column encoding) is
> > > > widely adopted / readable.
> > > >
> > > > On Wed, Apr 24, 2024 at 9:32 AM Weston Pace <weston.p...@gmail.com>
> > > wrote:
> > > >
> > > > > > *As per Apache Parquet Community Parquet V2 is not final yet so
> it
> > is
> > > > not
> > > > > > official . They are advising not to use Parquet V2 for writing
> > > (though
> > > > > code
> > > > > > is available ) .*
> > > > >
> > > > > This would be news to me.  Parquet releases are listed (by the
> > parquet
> > > > > community) at [1]
> > > > >
> > > > > The vote to release parquet 2.10 is here: [2]
> > > > >
> > > > > Neither of these links mention anything about this being an
> > > experimental,
> > > > > unofficial, or non-finalized release.
> > > > >
> > > > > I understand your concern.  I believe your quotes are coming from
> > your
> > > > > discussion on the parquet mailing list here [3].  This
> communication
> > is
> > > > > unfortunate and confusing to me as well.
> > > > >
> > > > > [1] https://parquet.apache.org/blog/
> > > > > [2]
> https://lists.apache.org/thread/fdf1zz0f3xzz5zpvo6c811xjswhm1zy6
> > > > > [3]
> https://lists.apache.org/thread/4nzroc68czwxnp0ndqz15kp1vhcd7vg3
> > > > >
> > > > >
> > > > > On Wed, Apr 24, 2024 at 5:10 AM Prem Sahoo <prem.re...@gmail.com>
> > > wrote:
> > > > >
> > > > > > Hello Jacob,
> > > > > > Thanks for the information, and my apologies for the weird format
> > of
> > > my
> > > > > > email.
> > > > > >
> > > > > > This is the email from the Parquet community. May I know why
> > pyarrow
> > > is
> > > > > > using Parquet V2 which is not official yet ?
> > > > > >
> > > > > > My question is from Parquet community V2 is not final yet so it
> is
> > > not
> > > > > > official yet.
> > > > > > "Hi Prem - Maybe I can help clarify to the best of my knowledge.
> > > > Parquet
> > > > > V2
> > > > > > as a standard isn't finalized just yet. Meaning there is no
> formal,
> > > > > > *finalized* "contract" that specifies what it means to write data
> > in
> > > > the
> > > > > V2
> > > > > > version. The discussions/conversations about what the final V2
> > > standard
> > > > > may
> > > > > > be are still in progress and are evolving.
> > > > > >
> > > > > > That being said, because V2 code does exist (though unfinalized),
> > > there
> > > > > are
> > > > > > clients / tools that are writing data in the un-finalized V2
> > format,
> > > as
> > > > > > seems to be the case with Dremio.
> > > > > >
> > > > > > Now, as that comment you quoted said, you can have Spark write V2
> > > > files,
> > > > > > but it's worth being mindful about the fact that V2 is a moving
> > > target
> > > > > and
> > > > > > can (and likely will) change. You can overwrite
> > > parquet.writer.version
> > > > to
> > > > > > specify your desired version, but it can be dangerous to produce
> > data
> > > > in
> > > > > a
> > > > > > moving-target format. For example, let's say you write a bunch of
> > > data
> > > > in
> > > > > > Parquet V2, and then the community decides to make a breaking
> > change
> > > > > (which
> > > > > > is completely fine / allowed since V2 isn't finalized). You are
> now
> > > > left
> > > > > > having to deal with a potentially large and complicated file
> format
> > > > > update.
> > > > > > That's why it's not recommended to write files in parquet v2 just
> > > yet."
> > > > > >
> > > > > >
> > > > > > *As per Apache Parquet Community Parquet V2 is not final yet so
> it
> > is
> > > > not
> > > > > > official . They are advising not to use Parquet V2 for writing
> > > (though
> > > > > code
> > > > > > is available ) .*
> > > > > >
> > > > > >
> > > > > > *As per above Spark hasn't started using Parquet V2 for writing
> *.
> > > > > >
> > > > > > May I know how an unstable /unofficial  version is being used in
> > > > pyarrow
> > > > > ?
> > > > > >
> > > > > >
> > > > > > On Wed, Apr 24, 2024 at 12:43 AM Jacob Wujciak <
> > > assignu...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > First off, please try to clean up formating of emails to be
> > legible
> > > > > when
> > > > > > > forwarding/quoting previous messages multiple times, especially
> > > when
> > > > > most
> > > > > > > of the quotes do not contain any useful information. It makes
> it
> > > much
> > > > > > > easier to parse the message and thus quicker to answer.
> > > > > > >
> > > > > > > The short answer is that we switched to 2.4 and more recently
> to
> > > 2.6
> > > > as
> > > > > > > the default to enable the usage of features these versions
> > provide.
> > > > As
> > > > > > you
> > > > > > > have correctly quoted from the docs you can still write 1.0 if
> > you
> > > > want
> > > > > > to
> > > > > > > ensure compatibility with systems that can not process the
> > 'newer'
> > > > > > versions
> > > > > > > yet (2.6 was released in 2018!).
> > > > > > >
> > > > > > > You can find the long form discussions about these changes
> here:
> > > > > > > https://issues.apache.org/jira/browse/ARROW-12203
> > > > > > >
> https://lists.apache.org/thread/027g366yr3m03hwtpst6sr58b3trwhsm
> > > > > > >
> > > > > > > Best
> > > > > > > Jacob
> > > > > > >
> > > > > > > On 2024/04/24 02:32:01 Prem Sahoo wrote:
> > > > > > > > Hello Team,
> > > > > > > > Could you please share your thoughts about below questions?
> > > > > > > > Sent from my iPhone
> > > > > > > >
> > > > > > > > Begin forwarded message:
> > > > > > > >
> > > > > > > > > From: Prem Sahoo <prem.re...@gmail.com>
> > > > > > > > > Date: April 23, 2024 at 11:03:48 AM EDT
> > > > > > > > > To: dev-ow...@arrow.apache.org
> > > > > > > > > Subject: Re: PyArrow Using Parquet V2
> > > > > > > > >
> > > > > > > > > dev@arrow.apache.org
> > > > > > > > > Sent from my iPhone
> > > > > > > > >
> > > > > > > > >>> On Apr 23, 2024, at 6:25 AM, Prem Sahoo <
> > > prem.re...@gmail.com>
> > > > > > > wrote:
> > > > > > > > >>>
> > > > > > > > >> Hello Team,
> > > > > > > > >> Could anyone please help me on below query?
> > > > > > > > >> Sent from my iPhone
> > > > > > > > >>
> > > > > > > > >>>> On Apr 22, 2024, at 10:01 PM, Prem Sahoo <
> > > > prem.re...@gmail.com>
> > > > > > > wrote:
> > > > > > > > >>>>
> > > > > > > > >>> 
> > > > > > > > >>> Sent from my iPhone
> > > > > > > > >>>
> > > > > > > > >>>>> On Apr 22, 2024, at 9:51 PM, Prem Sahoo <
> > > > prem.re...@gmail.com>
> > > > > > > wrote:
> > > > > > > > >>>>>
> > > > > > > > >>>> 
> > > > > > > > >>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> 
> > > > > > > > >>>>> Hello Team,
> > > > > > > > >>>>> I have a question regarding Parquet V2 writing thro
> > > pyarrow .
> > > > > > > > >>>>> As per below Pyarrow started writing Parquet in V2
> > > encoding.
> > > > > > > > >>>>>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
> > > > > > > > >>>>>
> > > > > > > > >>>>> version{“1.0”, “2.4”, “2.6”}, default “2.6”
> > > > > > > > >>>>> Determine which Parquet logical types are available for
> > > use,
> > > > > > > whether the reduced set from the Parquet 1.x.x format or the
> > > expanded
> > > > > > > logical types added in later format versions. Files written
> with
> > > > > > > version=’2.4’ or ‘2.6’ may not be readable in all Parquet
> > > > > > implementations,
> > > > > > > so version=’1.0’ is likely the choice that maximizes file
> > > > > compatibility.
> > > > > > > UINT32 and some logical types are only available with version
> > > ‘2.4’.
> > > > > > > Nanosecond timestamps are only available with version ‘2.6’.
> > Other
> > > > > > features
> > > > > > > such as compression algorithms or the new serialized data page
> > > format
> > > > > > must
> > > > > > > be enabled separately (see ‘compression’ and
> > ‘data_page_version’).
> > > > > > > > >>>>>
> > > > > > > > >>>>>
> > > > > > > > >>>>> As per Apache Parquet Community Parquet V2 is not final
> > yet
> > > > so
> > > > > it
> > > > > > > is not official . They are advising not to use Parquet V2 for
> > > writing
> > > > > > > (though code is available ) .
> > > > > > > > >>>>>
> > > > > > > > >>>>> As per above Spark hasn't started using Parquet V2 for
> > > > writing
> > > > > .
> > > > > > > > >>>>> May I know how an unstable /unofficial  version is
> being
> > > used
> > > > > in
> > > > > > > pyarrow ?
> > > > > > > > >>>>>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Fwd: PyArrow Using Parquet V2

Reply via email to