Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

Alkis Evlogimenos Wed, 05 Jun 2024 12:12:57 -0700

The thrift extension can allow us to append a flatbuffer footer to the
thrift FileMetaData as described here:
https://github.com/apache/parquet-format/pull/254/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R300-R318
.


Then all readers will skip the flatbuffer footer. The reader(s) that know
about the flatbuffer footer will use that in practice. The point of the
extension mechanism is to allow different organizations to do such
experiments without confusing other readers.

On Wed, Jun 5, 2024 at 7:07 PM Jan Finis <[email protected]> wrote:

> How would a thrift extension help if we'll be moving to FlatBuffers for
> metadata? How do the two things work together? How are we planning to
> extend FlatBuffers?
>
>
>
> Am Mi., 5. Juni 2024 um 18:48 Uhr schrieb Micah Kornfield <
> [email protected]>:
>
> > >
> > > 1. ratify https://github.com/apache/parquet-format/pull/254 as the
> > > extension mechanism for parquet. With this we can experiment on new
> > footers
> > > without having to specify anything else.
> >
> >
> > I think we have probably reached a lazy consensus that is reasonable. I
> > think I misspoke earlier but we should at least have parquet-java and a
> > second implementation showing that we can write out the arbitrary bytes
> > without too much issue (and also read the a file that is written in this
> > format).  Alkis would you be able to do this?
> >
> >
> > 3. collaborate on a couple of prototypes, test them in production and
> come
> > > up with a report advocating for their inclusion to parquet proper. With
> > (1)
> > > in place these experiments/prototypes can be done in parallel and
> tested
> > by
> > > different organizations without coupling
> >
> >
> > I think it makes sense to timebox.  Did you have any thoughts on the
> > duration of experimentation?
> >
> > 4. decide which candidate is made official announce the migration path
> > > (deprecate old footers and give timeline for stopping the emission of
> > dual
> > > footers)
> >
> > I hope the strawman proposal on feature releases [1] can be refined and
> > applied to this case.
> >
> >
> > Thanks,
> > Micah
> >
> > [1] https://github.com/apache/parquet-format/pull/258
> >
> >
> >
> > On Wed, Jun 5, 2024 at 2:47 AM Antoine Pitrou <[email protected]>
> wrote:
> >
> > >
> > > Google docs tend to get lost very quickly. My experience with the
> > > Python PEP process leads me to a preference for a .md file in the repo,
> > > that can be collectively owned and rely on regular GH-based review
> > > tools.
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> > >
> > > On Tue, 4 Jun 2024 18:52:42 -0700
> > > Julien Le Dem <[email protected]> wrote:
> > > > I agree that flatbuffer is a good option if we are happy with the
> perf
> > > and
> > > > it let's access column metadata in O(1) without reading other
> columns.
> > > > If we're going to make an incompatible metadata change, let's make it
> > > once
> > > > with a transition path to easily move from PAR1 to PAR3 letting them
> > > > coexist in a backward compatible phase for a while.
> > > >
> > > > I think that before voting on this, we should summarize in a doc the
> > > whole
> > > > PAR3 footer metadata discussion:
> > > > 1) Goals: (O(1) random access, extensibility, ...)
> > > > 2) preferred option
> > > > 3) migration path.
> > > > 4) mention other options we considered and why we didn't pick them
> > (this
> > > > doesn't have to be extensive)
> > > >
> > > > That will make it easier for people who are impacted but haven't
> > actively
> > > > contributed to this discussion so far to review and chime in.
> > > > This is a big change with large potential impact here.
> > > >
> > > > Do people prefer google doc or a PR with a .md for this? I personally
> > > like
> > > > google docs (we can copy it in the repo after approval)
> > > >
> > > > Julien
> > > >
> > > >
> > > > On Tue, Jun 4, 2024 at 1:53 AM Alkis Evlogimenos
> > > > <[email protected]> wrote:
> > > >
> > > > > >
> > > > > > Finally, one point I wanted to highlight here (I also mentioned
> it
> > > in the
> > > > > > PR): If we want random access, we have to abolish the concept
> that
> > > the
> > > > > data
> > > > > > in the columns array is in a different order than in the schema.
> > > Your PR
> > > > > > [1] even added a new field schema_index for matching between
> > > > > > ColumnMetaData and schema position, but this kills random access.
> > If
> > > I
> > > > > want
> > > > > > to read the third column in the schema, then do a O(1) random
> > access
> > > into
> > > > > > the third column chunk only to notice that it's schema index is
> > > totally
> > > > > > different and therefore I need a full exhaustive search to find
> the
> > > > > column
> > > > > > that actually belongs to the third column in the schema, then all
> > our
> > > > > > random access efforts are in vain.
> > > > >
> > > > >
> > > > > `schema_index` is useful to implement
> > > > > https://issues.apache.org/jira/browse/PARQUET-183 which is more
> and
> > > more
> > > > > prevalent as schemata become wider.
> > > > >
> > > > > On Mon, Jun 3, 2024 at 5:54 PM Micah Kornfield <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Thanks everyone for chiming in.  Some responses inline:
> > > > > >
> > > > > > >
> > > > > > > The thrift
> > > > > > > decoder just has to be invoked recursively whenever such a lazy
> > > field
> > > > > is
> > > > > > > required. This is nice, but since it doesn't give us random
> > access
> > > into
> > > > > > > lists, it's also only partially helpful.
> > > > > >
> > > > > > This point is moot if we move to flatbuffers but I think part of
> > the
> > > > > > proposals are either using list<binary> or providing arrow-like
> > > offsets
> > > > > > into the serialized binary to support random access of elements.
> > > > > >
> > > > > >
> > > > > > > I don't fully understand this point, can you elaborate on it.
> It
> > > feels
> > > > > > like
> > > > > > > a non-issue or a super edge case to me. Is this just a DuckDB
> > > issue? If
> > > > > > so,
> > > > > > > I am very sure they're happy to change this, as they're quite
> > > active
> > > > > and
> > > > > > > also strive for simplicity and I would argue that exposing
> thrift
> > > > > > directly
> > > > > > > isn't that.
> > > > > >
> > > > > > IIUC, I don't think Thrift is public from an end-user
> perspective.
> > > It is
> > > > > > however public in the fact that internally DuckDB exposes the
> > Thrift
> > > > > > structs directly to consuming code.
> > > > > >
> > > > > > * I don't think there is value in providing a 1-to-1 mapping from
> > > the
> > > > > > >   old footer encoding to the new encoding. On the contrary,
> this
> > > is the
> > > > > > >   opportunity to clean up and correct some of the oddities that
> > > have
> > > > > > >   accumulated in the past.
> > > > > >
> > > > > > I think I should clarify this, as I see a few distinct cases
> here:
> > > > > >
> > > > > > 1.  Removing duplication/redundancy that accumulated over the
> years
> > > for
> > > > > > backwards compatibility.
> > > > > > 2.  Removing fields that were never used in practice.
> > > > > > 3.  Changing the layout of fields (e.g. moving from array of
> > structs
> > > to
> > > > > > struct of arrays) for performance considerations.
> > > > > > 4.  Writing potentially less metadata (e.g. summarization of
> > metadata
> > > > > > today).
> > > > > >
> > > > > > IMO, I think we should be doing 1,2, and 3.  I don't think we
> > should
> > > be
> > > > > > doing 4 (e.g. as a concrete example, see the discussion on
> > > > > > PageEncodingStats [1]).
> > > > > >
> > > > > > If we want random access, we have to abolish the concept that the
> > > data
> > > > > > > in the columns array is in a different order than in the
> schema.
> > > Your
> > > > > PR
> > > > > > > [1] even added a new field schema_index for matching between
> > > > > > ColumnMetaData
> > > > > > > and schema position, but this kills random access.
> > > > > >
> > > > > >
> > > > > > I think this is a larger discussion that should be split off, as
> I
> > > don't
> > > > > > think it should block the core work here.  This was adapted from
> > > another
> > > > > > proposal, that I think had different ideas on how possible rework
> > > column
> > > > > > selection (it seems this would be on a per RowGroup basis).
> > > > > >
> > > > > > [1]
> > > https://github.com/apache/parquet-format/pull/250/files#r1620984136
> > > > > >
> > > > > >
> > > > > > On Mon, Jun 3, 2024 at 8:20 AM Antoine Pitrou <
> [email protected]>
> > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Everything Jan said below aligns closely with my opinion.
> > > > > > >
> > > > > > > * +1 for going directly to Flatbuffers for the new footer
> format
> > > *if*
> > > > > > >   there is a general agreement that moving to Flatbuffers at
> some
> > > point
> > > > > > >   is desirable (including from a software ecosystem point of
> > view).
> > > > > > >
> > > > > > > * I don't think there is value in providing a 1-to-1 mapping
> from
> > > the
> > > > > > >   old footer encoding to the new encoding. On the contrary,
> this
> > > is the
> > > > > > >   opportunity to clean up and correct some of the oddities that
> > > have
> > > > > > >   accumulated in the past.
> > > > > > >
> > > > > > > Regards
> > > > > > >
> > > > > > > Antoine.
> > > > > > >
> > > > > > >
> > > > > > > On Mon, 3 Jun 2024 15:58:40 +0200
> > > > > > > Jan Finis <[email protected]> wrote:
> > > > > > > > Interesting discussion so far, thanks for driving this
> Micah! A
> > > few
> > > > > > > points
> > > > > > > > from my side:
> > > > > > > >
> > > > > > > > When considering flatbuffers vs. lazy "binary" nested thrift,
> > > vs. own
> > > > > > > > MetaDataPage format, let's also keep architectural simplicity
> > > in
> > > > > mind.
> > > > > > > >
> > > > > > > > For example, introducing flatbuffers might sound like a big
> > > change at
> > > > > > > > first, but at least it is then *one format* for everything.
> In
> > > > > > contrast,
> > > > > > > > thrift + custom MetaDataPage is two formats. My gut feeling
> > > estimate
> > > > > > > > would be that it is probably easier to just introduce a
> > > flatbuffers
> > > > > > > reader
> > > > > > > > instead of special casing some thrift to instead need a
> custom
> > > > > > > MetaDataPage
> > > > > > > > reader.
> > > > > > > >
> > > > > > > > The lazy thrift "hack" is something in between the two. It is
> > > > > probably
> > > > > > > the
> > > > > > > > easiest to adopt, as no new reading logic needs to be
> written.
> > > The
> > > > > > thrift
> > > > > > > > decoder just has to be invoked recursively whenever such a
> lazy
> > > field
> > > > > > is
> > > > > > > > required. This is nice, but since it doesn't give us random
> > > access
> > > > > into
> > > > > > > > lists, it's also only partially helpful.
> > > > > > > >
> > > > > > > > Given all this, from the implementation / architectural
> > > cleanliness
> > > > > > > side, I
> > > > > > > > guess I would prefer just using flatbuffers, unless we find
> big
> > > > > > > > disadvantages with this. This also brings us closer to Arrow,
> > > > > although
> > > > > > > > that's not too important here.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > 1.  I think for an initial revision of metadata we should
> > make
> > > it
> > > > > > > possible
> > > > > > > > > to have a 1:1 mapping between PAR1 footers and whatever is
> > > included
> > > > > > in
> > > > > > > the
> > > > > > > > > new footer.  The rationale for this is to let
> implementations
> > > that
> > > > > > > haven't
> > > > > > > > > abstracted out thrift structures an easy path to
> > incorporating
> > > the
> > > > > > new
> > > > > > > > > footer (i.e. just do translation at the boundaries).
> > > > > > > > >
> > > > > > > >
> > > > > > > > I don't fully understand this point, can you elaborate on it.
> > > It
> > > > > feels
> > > > > > > like
> > > > > > > > a non-issue or a super edge case to me. Is this just a DuckDB
> > > issue?
> > > > > If
> > > > > > > so,
> > > > > > > > I am very sure they're happy to change this, as they're quite
> > > active
> > > > > > and
> > > > > > > > also strive for simplicity and I would argue that exposing
> > > thrift
> > > > > > > directly
> > > > > > > > isn't that. Our database also allows metadata access in SQL,
> > but
> > > we
> > > > > > > > transcode the thrift into JSON. Given that JSON is pretty
> > > standard in
> > > > > > > > databases while thrift isn't, I'm sure DuckDB devs will see
> it
> > > the
> > > > > > same.
> > > > > > > >
> > > > > > > >
> > > > > > > > Finally, one point I wanted to highlight here (I also
> mentioned
> > > it in
> > > > > > the
> > > > > > > > PR): If we want random access, we have to abolish the concept
> > > that
> > > > > the
> > > > > > > data
> > > > > > > > in the columns array is in a different order than in the
> > schema.
> > > Your
> > > > > > PR
> > > > > > > > [1] even added a new field schema_index for matching between
> > > > > > > ColumnMetaData
> > > > > > > > and schema position, but this kills random access. If I want
> to
> > > read
> > > > > > the
> > > > > > > > third column in the schema, then do a O(1) random access into
> > > the
> > > > > third
> > > > > > > > column chunk only to notice that it's schema index is totally
> > > > > different
> > > > > > > and
> > > > > > > > therefore I need a full exhaustive search to find the column
> > > that
> > > > > > > actually
> > > > > > > > belongs to the third column in the schema, then all our
> random
> > > access
> > > > > > > > efforts are in vain.
> > > > > > > >
> > > > > > > > Therefore, the only possible way to make random access useful
> > is
> > > to
> > > > > > > mandate
> > > > > > > > that ColumnMetaData in the columns list has to be in exactly
> > the
> > > same
> > > > > > > order
> > > > > > > > in which the columns appear in the schema.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Jan
> > > > > > > >
> > > > > > > > [1] https://github.com/apache/parquet-format/pull/250
> > > > > > > >
> > > > > > > >
> > > > > > > > Am Sa., 1. Juni 2024 um 10:38 Uhr schrieb Micah Kornfield <
> > > > > > > > [email protected]>:
> > > > > > > >
> > > > > > > > > As an update here/some responses.  Alkis [3] is making
> > > considerable
> > > > > > > > > progress on a Flatbuffer alternative that shows good
> > > performance
> > > > > > > benchmarks
> > > > > > > > > on some real sample footers (and hopefully soon some
> > synthetic
> > > data
> > > > > > > from
> > > > > > > > > Rok).
> > > > > > > > >
> > > > > > > > > The approaches that currently have public PRs [1][2] IIUC
> > > mostly
> > > > > save
> > > > > > > time
> > > > > > > > > by lazily decompressing thrift metadata (some of the
> details
> > > differ
> > > > > > > but it
> > > > > > > > > is effectively the same mechanism).  This helps for cases
> > > when
> > > > > only a
> > > > > > > few
> > > > > > > > > row groups/columns are needed but in the limit has the same
> > > > > > theoretical
> > > > > > > > > performance penalties for full table reads.
> > > > > > > > >
> > > > > > > > > I would like to get people's take on two points:
> > > > > > > > > 1.  I think for an initial revision of metadata we should
> > make
> > > it
> > > > > > > possible
> > > > > > > > > to have a 1:1 mapping between PAR1 footers and whatever is
> > > included
> > > > > > in
> > > > > > > the
> > > > > > > > > new footer.  The rationale for this is to let
> implementations
> > > that
> > > > > > > haven't
> > > > > > > > > abstracted out thrift structures an easy path to
> > incorporating
> > > the
> > > > > > new
> > > > > > > > > footer (i.e. just do translation at the boundaries).
> > > > > > > > > 2.  Do people see value in trying to do a Thrift only
> > > iteration
> > > > > which
> > > > > > > > > addresses the use-case of scanning only a select number of
> > row
> > > > > > > > > groups/columns?  Or if Flatbuffers offer an overall better
> > > > > > performance
> > > > > > > > > should we jump to using it?
> > > > > > > > >
> > > > > > > > > After processing the comments I think we might want to
> > discuss
> > > the
> > > > > > > > > > extension point
> > > > > https://github.com/apache/parquet-format/pull/254
> > > > > > > > > >  separately.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I think this is already getting reviewed (I also think we
> > > touched
> > > > > on
> > > > > > > it in
> > > > > > > > > the extensibility thread).  Since this is really just
> > defining
> > > how
> > > > > we
> > > > > > > can
> > > > > > > > > encapsulate data and doesn't involve any upfront work, I
> > think
> > > once
> > > > > > > > > everyone has had a chance to comment on it we can hopefully
> > > hold a
> > > > > > > vote on
> > > > > > > > > it (hopefully in the next week or 2).  I think the only
> other
> > > > > viable
> > > > > > > > > alternative is what is proposed in [2] which doesn't
> involve
> > > any
> > > > > > > mucking
> > > > > > > > > with Thrift bytes but poses a slightly larger compatibility
> > > risk.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Micah
> > > > > > > > >
> > > > > > > > > [1] https://github.com/apache/parquet-format/pull/242
> > > > > > > > > [2] https://github.com/apache/parquet-format/pull/250
> > > > > > > > > [3]
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/250#pullrequestreview-2091174869
> > >
> > > > > > > > >
> > > > > > > > > On Thu, May 30, 2024 at 7:21 AM Alkis Evlogimenos <
> > > > > > > > > [email protected]> wrote:
> > > > > > > > >
> > > > > > > > > > Thank you for summarizing Micah and thanks to everyone
> > > commenting
> > > > > > on
> > > > > > > the
> > > > > > > > > > proposal and PRs.
> > > > > > > > > >
> > > > > > > > > > After processing the comments I think we might want to
> > > discuss
> > > > > the
> > > > > > > > > > extension point
> > > > > https://github.com/apache/parquet-format/pull/254
> > > > > > > > > > separately.
> > > > > > > > > >
> > > > > > > > > > The extension point will allow vendors to experiment on
> > > different
> > > > > > > > > metadata
> > > > > > > > > > (be it FileMetaData, or ColumnMetaData etc) and when a
> > > design is
> > > > > > > ready
> > > > > > > > > and
> > > > > > > > > > validated in large scale, it can be discussed for
> inclusion
> > > to
> > > > > the
> > > > > > > > > official
> > > > > > > > > > specification.
> > > > > > > > > >
> > > > > > > > > > On Thu, May 30, 2024 at 9:37 AM Micah Kornfield <
> > > > > > > [email protected]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> As an update Alkis wrote up a nice summary of his
> thoughts
> > > > > [1][2].
> > > > > > > > > >>
> > > > > > > > > >> I updated my PR <
> > > > > > https://github.com/apache/parquet-format/pull/250>
> > > > > > > > > >> [3] to be more complete.  At a high-level (for those
> that
> > > have
> > > > > > > already
> > > > > > > > > >> reviewed):
> > > > > > > > > >> 1. I converted more fields to use page-encoding (or
> added
> > > a
> > > > > binary
> > > > > > > field
> > > > > > > > > >> for thrift serialized encoding when they are expected to
> > > be
> > > > > > small).
> > > > > > > > > >> This might be overdone (happy for this feedback to
> > debate).
> > > > > > > > > >> 2.  I removed the concept of an external data page for
> the
> > > sake
> > > > > of
> > > > > > > > > trying
> > > > > > > > > >> to remove design options (we should still benchmark
> this).
> > > It
> > > > > also
> > > > > > > I
> > > > > > > > > think
> > > > > > > > > >> eases implementation burden (more on this below).
> > > > > > > > > >> 3.  Removed the new encoding.
> > > > > > > > > >> 4.  I think this is still missing some of the exact
> > changes
> > > from
> > > > > > > other
> > > > > > > > > >> PRs, some of those might be in error (please highlight
> > > them) and
> > > > > > > some
> > > > > > > > > are
> > > > > > > > > >> because I hope the individual PRs (i.e. the statistics
> > > change
> > > > > that
> > > > > > > Alkis
> > > > > > > > > >> proposed can get merged before any proposal)
> > > > > > > > > >>
> > > > > > > > > >> Regarding embedding PAR3 embedding, Alkis's doc [1]
> > > highlights
> > > > > > > another
> > > > > > > > > >> option for doing this that might be more robust but
> > > slightly
> > > > > more
> > > > > > > > > >> complicated.
> > > > > > > > > >>
> > > > > > > > > >> I think in terms of items already discussed, whether to
> > try
> > > to
> > > > > > reuse
> > > > > > > > > >> existing structures or use new structures (Alkis is
> > > proposing
> > > > > > going
> > > > > > > > > >> straight to flatbuffers in this regard IIUC after some
> > > more
> > > > > > tactical
> > > > > > > > > >> changes).  I think another point raised is the problem
> > with
> > > new
> > > > > > > > > structures
> > > > > > > > > >> is they require implementations (e.g. DuckDB) that do
> not
> > > > > > > encapsulate
> > > > > > > > > >> Thrift well to make potentially much larger structural
> > > changes.
> > > > > > > The
> > > > > > > > > way I
> > > > > > > > > >> tried to approach it in my PR is it should be O(days)
> work
> > > to
> > > > > take
> > > > > > > a
> > > > > > > > > PAR3
> > > > > > > > > >> footer and convert it back to PAR1, which will hopefully
> > > allow
> > > > > > other
> > > > > > > > > >> Parquet parsers in the ecosystems to at least get
> > > incorporated
> > > > > > > sooner
> > > > > > > > > even
> > > > > > > > > >> if no performance benefits are seen.
> > > > > > > > > >>
> > > > > > > > > >> Quoting from a separate thread that Alkis Started:
> > > > > > > > > >>
> > > > > > > > > >> 3 is important if we strongly believe that we can get
> the
> > > best
> > > > > > > design
> > > > > > > > > >>> through testing prototypes on real data and measuring
> the
> > > > > effects
> > > > > > > vs
> > > > > > > > > >>> designing changes in PRs. Along the same lines, I am
> > > requesting
> > > > > > > that
> > > > > > > > > you
> > > > > > > > > >>> ask through your contacts/customers (I will do the
> same)
> > > for
> > > > > > > scrubbed
> > > > > > > > > >>> footers of particular interest (wide, deep, etc) so
> that
> > > we can
> > > > > > > build a
> > > > > > > > > >>> set
> > > > > > > > > >>> of real footers on which we can run benchmarks and
> drive
> > > design
> > > > > > > > > >>> decisions.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> I agree with this sentiment. I think some others who
> have
> > > > > > > volunteered to
> > > > > > > > > >> work on this have such data and I will see what I can do
> > on
> > > my
> > > > > > > end.  I
> > > > > > > > > >> think we should hold off more drastic
> changes/improvements
> > > until
> > > > > > we
> > > > > > > can
> > > > > > > > > get
> > > > > > > > > >> better metrics.  But I also don't think we should let
> the
> > > "best"
> > > > > > be
> > > > > > > the
> > > > > > > > > >> enemy of the "good".  I hope we can ship a PAR3 footer
> > > sooner
> > > > > that
> > > > > > > gets
> > > > > > > > > us
> > > > > > > > > >> a large improvement over the status quo and have it
> > > adopted
> > > > > fairly
> > > > > > > > > widely
> > > > > > > > > >> sooner rather than waiting for an optimal design.  I
> also
> > > agree
> > > > > > > leaving
> > > > > > > > > >> room for experimentation is a good idea (I think this
> can
> > > > > probably
> > > > > > > be
> > > > > > > > > done
> > > > > > > > > >> by combining the methods for embedding that have already
> > > been
> > > > > > > discussed
> > > > > > > > > to
> > > > > > > > > >> allow potentially 2 embedded footers).
> > > > > > > > > >>
> > > > > > > > > >> I think another question that Alkis's proposals raised
> is
> > > how
> > > > > > > policies
> > > > > > > > > on
> > > > > > > > > >> deprecation of fields (especially ones that are
> currently
> > > > > required
> > > > > > > in
> > > > > > > > > >> PAR1).  I think this is probably a better topic for
> > > another
> > > > > > thread,
> > > > > > > I'll
> > > > > > > > > >> try to write a PR formalizing a proposal on feature
> > > evolution.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> [1]
> > > > > > > > > >>
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit
> > >
> > > > > > >
> > > > > > > > > >> [2]
> > > > > > >
> https://lists.apache.org/thread/zdpswrd4yxrj845rmoopqozhk0vrm6vo
> > > > > > > > > >> [3] https://github.com/apache/parquet-format/pull/250
> > > > > > > > > >>
> > > > > > > > > >> On Tue, May 28, 2024 at 10:56 AM Micah Kornfield <
> > > > > > > [email protected]
> > > > > > > > > >
> > > > > > > > > >> wrote:
> > > > > > > > > >>
> > > > > > > > > >>> Hi Antoine,
> > > > > > > > > >>> Thanks for the great points.  Responses inline.
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>>> I like your attempt to put the "new" file metadata
> after
> > > the
> > > > > > > legacy
> > > > > > > > > >>>> one in
> > https://github.com/apache/parquet-format/pull/250,
> > >
> > > > > and I
> > > > > > > hope
> > > > > > > > > it
> > > > > > > > > >>>> can actually be made to work (it requires current
> > > Parquet
> > > > > > readers
> > > > > > > to
> > > > > > > > > >>>> allow/ignore arbitrary padding at the end of the v1
> > > Thrift
> > > > > > > metadata).
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>> Thanks (I hope so too).  I think the idea is originally
> > > from
> > > > > > > Alkis.  If
> > > > > > > > > >>> it doesn't work then there is always an option of
> doing a
> > > > > little
> > > > > > > more
> > > > > > > > > >>> involved process of making the footer look like an
> > > unknown
> > > > > binary
> > > > > > > > > field (an
> > > > > > > > > >>> approach I know you have objections to).
> > > > > > > > > >>>
> > > > > > > > > >>> I'm biased, but I find it much cleaner to define new
> > > Thrift
> > > > > > > > > >>>>   structures (FileMetadataV3, etc.), rather than
> > > painstakinly
> > > > > > > document
> > > > > > > > > >>>>   which fields are to be omitted in V3. That would
> > > achieve
> > > > > three
> > > > > > > > > goals:
> > > > > > > > > >>>>   1) make the spec easier to read (even though it
> would
> > > be
> > > > > > > physically
> > > > > > > > > >>>>   longer); 2) make it easier to produce a conformant
> > > > > > > implementation
> > > > > > > > > >>>>   (special rules increase the risks of
> misunderstandings
> > > and
> > > > > > > > > >>>>   disagreements); 3) allow a later cleanup of the spec
> > > once we
> > > > > > > agree
> > > > > > > > > to
> > > > > > > > > >>>>   get rid of V1 structs.
> > > > > > > > > >>>
> > > > > > > > > >>> There are trade-offs here.  I agree with the benefits
> you
> > > > > listed
> > > > > > > here.
> > > > > > > > > >>> The benefits of reusing existing structs are:
> > > > > > > > > >>> 1. Lowers the amount of boiler plate code mapping from
> > one
> > > to
> > > > > the
> > > > > > > other
> > > > > > > > > >>> (i.e. simpler initial implementation), since I expect
> it
> > > will
> > > > > be
> > > > > > > a
> > > > > > > > > while
> > > > > > > > > >>> before we have standalone PAR3 files.
> > > > > > > > > >>> 2. Allows for lower maintenance burden if there is
> useful
> > > new
> > > > > > > metadata
> > > > > > > > > >>> that we would like to see added to both structures
> > > original and
> > > > > > > "V3"
> > > > > > > > > >>> structures.
> > > > > > > > > >>>
> > > > > > > > > >>> - The new encoding in that PR seems like it should be
> > > moved to
> > > > > a
> > > > > > > > > >>>>   separate PR and be discussed in the encodings
> thread?
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>> I'll cross post on that thread.  The main reason I
> > > included it
> > > > > in
> > > > > > > my
> > > > > > > > > >>> proposal is I think it provides random access for
> members
> > > out
> > > > > of
> > > > > > > the
> > > > > > > > > box
> > > > > > > > > >>> (as compared to the existing encodings).  I think this
> > > mostly
> > > > > > goes
> > > > > > > to
> > > > > > > > > your
> > > > > > > > > >>> third-point so I'll discuss below.
> > > > > > > > > >>>
> > > > > > > > > >>> - I'm a bit skeptical about moving Thrift lists into
> data
> > > > > pages,
> > > > > > > rather
> > > > > > > > > >>>>   than, say, just embed the corresponding Thrift
> > > serialization
> > > > > > as
> > > > > > > > > >>>>   binary fields for lazy deserialization.
> > > > > > > > > >>>
> > > > > > > > > >>> I think this falls into 2 different concerns:
> > > > > > > > > >>> 1.  The format of how we serialize metadata.
> > > > > > > > > >>> 2.  Where the serialized metadata lives.
> > > > > > > > > >>>
> > > > > > > > > >>> For concern #1, I think we should be considering
> treating
> > > these
> > > > > > > lists
> > > > > > > > > as
> > > > > > > > > >>> actual parquet data pages.  This allows users to tune
> > this
> > > to
> > > > > > > their
> > > > > > > > > needs
> > > > > > > > > >>> for size vs decoding speed, and make use of any
> > > improvements to
> > > > > > > > > encoding
> > > > > > > > > >>> that happen in the future without a spec change. I
> think
> > > this
> > > > > is
> > > > > > > likely
> > > > > > > > > >>> fairly valuable given the number of systems that cache
> > > this
> > > > > data.
> > > > > > > The
> > > > > > > > > >>> reason I introduced the new encoding was to provide an
> > > option
> > > > > > > that
> > > > > > > > > could be
> > > > > > > > > >>> as efficient as possible from a compute perspective.
> > > > > > > > > >>>
> > > > > > > > > >>> For concern #2, there is no reason encoding a page as a
> > > thrift
> > > > > > > Binary
> > > > > > > > > >>> field would not work. The main reason I raised putting
> > > them
> > > > > > > outside of
> > > > > > > > > >>> thrift is for greater control on deserialization (the
> > > main
> > > > > > > benefit
> > > > > > > > > being
> > > > > > > > > >>> avoiding copies) for implementations that have a Thrift
> > > parser
> > > > > > > that
> > > > > > > > > doesn't
> > > > > > > > > >>> allow these optimizations.  In terms of a path forward
> > > here, I
> > > > > > > think
> > > > > > > > > >>> understanding the performance and memory
> characteristics
> > > of
> > > > > each
> > > > > > > > > approach.
> > > > > > > > > >>> I agree, if there isn't substantial savings from having
> > > them be
> > > > > > > > > outside the
> > > > > > > > > >>> page, then it just adds complexity.
> > > > > > > > > >>>
> > > > > > > > > >>> Thanks,
> > > > > > > > > >>> Micah
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>>
> > > > > > > > > >>> On Tue, May 28, 2024 at 7:06 AM Antoine Pitrou <
> > > > > > [email protected]
> > > > > > > >
> > > > > > > > > >>> wrote:
> > > > > > > > > >>>
> > > > > > > > > >>>>
> > > > > > > > > >>>> Hello Micah,
> > > > > > > > > >>>>
> > > > > > > > > >>>> First, kudos for doing this!
> > > > > > > > > >>>>
> > > > > > > > > >>>> I like your attempt to put the "new" file metadata
> after
> > > the
> > > > > > > legacy
> > > > > > > > > >>>> one in
> > https://github.com/apache/parquet-format/pull/250,
> > >
> > > > > and I
> > > > > > > hope
> > > > > > > > > it
> > > > > > > > > >>>> can actually be made to work (it requires current
> > > Parquet
> > > > > > readers
> > > > > > > to
> > > > > > > > > >>>> allow/ignore arbitrary padding at the end of the v1
> > > Thrift
> > > > > > > metadata).
> > > > > > > > > >>>>
> > > > > > > > > >>>> Some assorted comments on other changes that PR is
> > doing:
> > > > > > > > > >>>>
> > > > > > > > > >>>> - I'm biased, but I find it much cleaner to define new
> > > Thrift
> > > > > > > > > >>>>   structures (FileMetadataV3, etc.), rather than
> > > painstakinly
> > > > > > > document
> > > > > > > > > >>>>   which fields are to be omitted in V3. That would
> > > achieve
> > > > > three
> > > > > > > > > goals:
> > > > > > > > > >>>>   1) make the spec easier to read (even though it
> would
> > > be
> > > > > > > physically
> > > > > > > > > >>>>   longer); 2) make it easier to produce a conformant
> > > > > > > implementation
> > > > > > > > > >>>>   (special rules increase the risks of
> misunderstandings
> > > and
> > > > > > > > > >>>>   disagreements); 3) allow a later cleanup of the spec
> > > once we
> > > > > > > agree
> > > > > > > > > to
> > > > > > > > > >>>>   get rid of V1 structs.
> > > > > > > > > >>>>
> > > > > > > > > >>>> - The new encoding in that PR seems like it should be
> > > moved
> > > > > to a
> > > > > > > > > >>>>   separate PR and be discussed in the encodings
> thread?
> > > > > > > > > >>>>
> > > > > > > > > >>>> - I'm a bit skeptical about moving Thrift lists into
> > > data
> > > > > pages,
> > > > > > > > > rather
> > > > > > > > > >>>>   than, say, just embed the corresponding Thrift
> > > serialization
> > > > > > as
> > > > > > > > > >>>>   binary fields for lazy deserialization.
> > > > > > > > > >>>>
> > > > > > > > > >>>> Regards
> > > > > > > > > >>>>
> > > > > > > > > >>>> Antoine.
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>> On Mon, 27 May 2024 23:06:37 -0700
> > > > > > > > > >>>> Micah Kornfield <
> > > > > > >
> > >
> >
> emkornfield-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org
> > >
> > >
> > > > > > > > > >>>> wrote:
> > > > > > > > > >>>> > As a follow-up to the "V3" Discussions [1][2] I
> wanted
> > > to
> > > > > > start
> > > > > > > a
> > > > > > > > > >>>> thread on
> > > > > > > > > >>>> > improvements to the footer metadata.
> > > > > > > > > >>>> >
> > > > > > > > > >>>> > Based on conversation so far, there have been a few
> > > > > proposals
> > > > > > > > > >>>> [3][4][5] to
> > > > > > > > > >>>> > help better support files with wide schemas and many
> > > > > > > row-groups.  I
> > > > > > > > > >>>> think
> > > > > > > > > >>>> > there are a lot of interesting ideas in each. It
> would
> > > be
> > > > > good
> > > > > > > to
> > > > > > > > > get
> > > > > > > > > >>>> > further feedback on these to make sure we aren't
> > > missing
> > > > > > > anything
> > > > > > > > > and
> > > > > > > > > >>>> > define a minimal first iteration for doing
> > > experimental
> > > > > > > benchmarking
> > > > > > > > > >>>> to
> > > > > > > > > >>>> > prove out an approach.
> > > > > > > > > >>>> >
> > > > > > > > > >>>> > I think the next steps would ideally be:
> > > > > > > > > >>>> > 1.  Come to a consensus on the overall approach.
> > > > > > > > > >>>> > 2.  Prototypes to Benchmark/test to validate the
> > > approaches
> > > > > > > defined
> > > > > > > > > >>>> (if we
> > > > > > > > > >>>> > can't come to consensus in item #1, this might help
> > > choose a
> > > > > > > > > >>>> direction).
> > > > > > > > > >>>> > 3.  Divide up any final approach into as
> fine-grained
> > > > > features
> > > > > > > as
> > > > > > > > > >>>> possible.
> > > > > > > > > >>>> > 4.  Implement across parquet-java, parquet-cpp,
> > > parquet-rs
> > > > > > (and
> > > > > > > any
> > > > > > > > > >>>> other
> > > > > > > > > >>>> > implementations that we can get volunteers for).
> > > > > > Additionally,
> > > > > > > if
> > > > > > > > > >>>> new APIs
> > > > > > > > > >>>> > are needed to make use of the new structure, it
> would
> > > be
> > > > > good
> > > > > > > to try
> > > > > > > > > >>>> to
> > > > > > > > > >>>> > prototype against consumers of Parquet.
> > > > > > > > > >>>> >
> > > > > > > > > >>>> > Knowing that we have enough people interested in
> doing
> > > #3 is
> > > > > > > > > critical
> > > > > > > > > >>>> to
> > > > > > > > > >>>> > success, so if you have time to devote, it would be
> > > helpful
> > > > > to
> > > > > > > chime
> > > > > > > > > >>>> in
> > > > > > > > > >>>> > here (I know some people already noted they could
> help
> > > in
> > > > > the
> > > > > > > > > original
> > > > > > > > > >>>> > thread).
> > > > > > > > > >>>> >
> > > > > > > > > >>>> > I think it is likely we will need either an in
> person
> > > sync
> > > > > or
> > > > > > > > > another
> > > > > > > > > >>>> more
> > > > > > > > > >>>> > focused design document could help. I am happy to
> try
> > > to
> > > > > > > facilitate
> > > > > > > > > >>>> this
> > > > > > > > > >>>> > (once we have a better sense of who wants to be
> > > involved and
> > > > > > > what
> > > > > > > > > time
> > > > > > > > > >>>> > zones they are in I can schedule a sync if
> necessary).
> > > > > > > > > >>>> >
> > > > > > > > > >>>> > Thanks,
> > > > > > > > > >>>> > Micah
> > > > > > > > > >>>> >
> > > > > > > > > >>>> > [1]
> > > > > > > > >
> > > https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
> > > > > > > > > >>>> > [2]
> > > > > > > > > >>>> >
> > > > > > > > > >>>>
> > > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > >
> >
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> > >
> > > > > > >
> > > > > > > > > >>>> > [3]
> https://github.com/apache/parquet-format/pull/242
> > > > > > > > > >>>> > [4]
> https://github.com/apache/parquet-format/pull/248
> > > > > > > > > >>>> > [5]
> https://github.com/apache/parquet-format/pull/250
> > > > > > > > > >>>> >
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > > >>>>
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > >
> >
>

Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

Reply via email to