Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

Micah Kornfield Mon, 03 Jun 2024 08:54:15 -0700

Thanks everyone for chiming in.  Some responses inline:

>
> The thrift
> decoder just has to be invoked recursively whenever such a lazy field is
> required. This is nice, but since it doesn't give us random access into
> lists, it's also only partially helpful.


This point is moot if we move to flatbuffers but I think part of the
proposals are either using list<binary> or providing arrow-like offsets
into the serialized binary to support random access of elements.


> I don't fully understand this point, can you elaborate on it. It feels like
> a non-issue or a super edge case to me. Is this just a DuckDB issue? If so,
> I am very sure they're happy to change this, as they're quite active and
> also strive for simplicity and I would argue that exposing thrift directly
> isn't that.

IIUC, I don't think Thrift is public from an end-user perspective.  It is
however public in the fact that internally DuckDB exposes the Thrift
structs directly to consuming code.

* I don't think there is value in providing a 1-to-1 mapping from the
>   old footer encoding to the new encoding. On the contrary, this is the
>   opportunity to clean up and correct some of the oddities that have
>   accumulated in the past.

I think I should clarify this, as I see a few distinct cases here:

1.  Removing duplication/redundancy that accumulated over the years for
backwards compatibility.
2.  Removing fields that were never used in practice.
3.  Changing the layout of fields (e.g. moving from array of structs to
struct of arrays) for performance considerations.
4.  Writing potentially less metadata (e.g. summarization of metadata
today).

IMO, I think we should be doing 1,2, and 3.  I don't think we should be
doing 4 (e.g. as a concrete example, see the discussion on
PageEncodingStats [1]).

If we want random access, we have to abolish the concept that the data
> in the columns array is in a different order than in the schema. Your PR
> [1] even added a new field schema_index for matching between ColumnMetaData
> and schema position, but this kills random access.


I think this is a larger discussion that should be split off, as I don't
think it should block the core work here.  This was adapted from another
proposal, that I think had different ideas on how possible rework column
selection (it seems this would be on a per RowGroup basis).

[1] https://github.com/apache/parquet-format/pull/250/files#r1620984136


On Mon, Jun 3, 2024 at 8:20 AM Antoine Pitrou <[email protected]> wrote:

>
> Everything Jan said below aligns closely with my opinion.
>
> * +1 for going directly to Flatbuffers for the new footer format *if*
>   there is a general agreement that moving to Flatbuffers at some point
>   is desirable (including from a software ecosystem point of view).
>
> * I don't think there is value in providing a 1-to-1 mapping from the
>   old footer encoding to the new encoding. On the contrary, this is the
>   opportunity to clean up and correct some of the oddities that have
>   accumulated in the past.
>
> Regards
>
> Antoine.
>
>
> On Mon, 3 Jun 2024 15:58:40 +0200
> Jan Finis <[email protected]> wrote:
> > Interesting discussion so far, thanks for driving this Micah! A few
> points
> > from my side:
> >
> > When considering flatbuffers vs. lazy "binary" nested thrift, vs. own
> > MetaDataPage format, let's also keep architectural simplicity in mind.
> >
> > For example, introducing flatbuffers might sound like a big change at
> > first, but at least it is then *one format* for everything. In contrast,
> > thrift + custom MetaDataPage is two formats. My gut feeling estimate
> > would be that it is probably easier to just introduce a flatbuffers
> reader
> > instead of special casing some thrift to instead need a custom
> MetaDataPage
> > reader.
> >
> > The lazy thrift "hack" is something in between the two. It is probably
> the
> > easiest to adopt, as no new reading logic needs to be written. The thrift
> > decoder just has to be invoked recursively whenever such a lazy field is
> > required. This is nice, but since it doesn't give us random access into
> > lists, it's also only partially helpful.
> >
> > Given all this, from the implementation / architectural cleanliness
> side, I
> > guess I would prefer just using flatbuffers, unless we find big
> > disadvantages with this. This also brings us closer to Arrow, although
> > that's not too important here.
> >
> >
> >
> > > 1.  I think for an initial revision of metadata we should make it
> possible
> > > to have a 1:1 mapping between PAR1 footers and whatever is included in
> the
> > > new footer.  The rationale for this is to let implementations that
> haven't
> > > abstracted out thrift structures an easy path to incorporating the new
> > > footer (i.e. just do translation at the boundaries).
> > >
> >
> > I don't fully understand this point, can you elaborate on it. It feels
> like
> > a non-issue or a super edge case to me. Is this just a DuckDB issue? If
> so,
> > I am very sure they're happy to change this, as they're quite active and
> > also strive for simplicity and I would argue that exposing thrift
> directly
> > isn't that. Our database also allows metadata access in SQL, but we
> > transcode the thrift into JSON. Given that JSON is pretty standard in
> > databases while thrift isn't, I'm sure DuckDB devs will see it the same.
> >
> >
> > Finally, one point I wanted to highlight here (I also mentioned it in the
> > PR): If we want random access, we have to abolish the concept that the
> data
> > in the columns array is in a different order than in the schema. Your PR
> > [1] even added a new field schema_index for matching between
> ColumnMetaData
> > and schema position, but this kills random access. If I want to read the
> > third column in the schema, then do a O(1) random access into the third
> > column chunk only to notice that it's schema index is totally different
> and
> > therefore I need a full exhaustive search to find the column that
> actually
> > belongs to the third column in the schema, then all our random access
> > efforts are in vain.
> >
> > Therefore, the only possible way to make random access useful is to
> mandate
> > that ColumnMetaData in the columns list has to be in exactly the same
> order
> > in which the columns appear in the schema.
> >
> > Cheers,
> > Jan
> >
> > [1] https://github.com/apache/parquet-format/pull/250
> >
> >
> > Am Sa., 1. Juni 2024 um 10:38 Uhr schrieb Micah Kornfield <
> > [email protected]>:
> >
> > > As an update here/some responses.  Alkis [3] is making considerable
> > > progress on a Flatbuffer alternative that shows good performance
> benchmarks
> > > on some real sample footers (and hopefully soon some synthetic data
> from
> > > Rok).
> > >
> > > The approaches that currently have public PRs [1][2] IIUC mostly save
> time
> > > by lazily decompressing thrift metadata (some of the details differ
> but it
> > > is effectively the same mechanism).  This helps for cases when only a
> few
> > > row groups/columns are needed but in the limit has the same theoretical
> > > performance penalties for full table reads.
> > >
> > > I would like to get people's take on two points:
> > > 1.  I think for an initial revision of metadata we should make it
> possible
> > > to have a 1:1 mapping between PAR1 footers and whatever is included in
> the
> > > new footer.  The rationale for this is to let implementations that
> haven't
> > > abstracted out thrift structures an easy path to incorporating the new
> > > footer (i.e. just do translation at the boundaries).
> > > 2.  Do people see value in trying to do a Thrift only iteration which
> > > addresses the use-case of scanning only a select number of row
> > > groups/columns?  Or if Flatbuffers offer an overall better performance
> > > should we jump to using it?
> > >
> > > After processing the comments I think we might want to discuss the
> > > > extension point https://github.com/apache/parquet-format/pull/254
> > > >  separately.
> > >
> > >
> > > I think this is already getting reviewed (I also think we touched on
> it in
> > > the extensibility thread).  Since this is really just defining how we
> can
> > > encapsulate data and doesn't involve any upfront work, I think once
> > > everyone has had a chance to comment on it we can hopefully hold a
> vote on
> > > it (hopefully in the next week or 2).  I think the only other viable
> > > alternative is what is proposed in [2] which doesn't involve any
> mucking
> > > with Thrift bytes but poses a slightly larger compatibility risk.
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1] https://github.com/apache/parquet-format/pull/242
> > > [2] https://github.com/apache/parquet-format/pull/250
> > > [3]
> > >
> > >
> https://github.com/apache/parquet-format/pull/250#pullrequestreview-2091174869
> > >
> > > On Thu, May 30, 2024 at 7:21 AM Alkis Evlogimenos <
> > > [email protected]> wrote:
> > >
> > > > Thank you for summarizing Micah and thanks to everyone commenting on
> the
> > > > proposal and PRs.
> > > >
> > > > After processing the comments I think we might want to discuss the
> > > > extension point https://github.com/apache/parquet-format/pull/254
> > > > separately.
> > > >
> > > > The extension point will allow vendors to experiment on different
> > > metadata
> > > > (be it FileMetaData, or ColumnMetaData etc) and when a design is
> ready
> > > and
> > > > validated in large scale, it can be discussed for inclusion to the
> > > official
> > > > specification.
> > > >
> > > > On Thu, May 30, 2024 at 9:37 AM Micah Kornfield <
> [email protected]>
> > > > wrote:
> > > >
> > > >> As an update Alkis wrote up a nice summary of his thoughts [1][2].
> > > >>
> > > >> I updated my PR <https://github.com/apache/parquet-format/pull/250>
> > > >> [3] to be more complete.  At a high-level (for those that have
> already
> > > >> reviewed):
> > > >> 1. I converted more fields to use page-encoding (or added a binary
> field
> > > >> for thrift serialized encoding when they are expected to be small).
> > > >> This might be overdone (happy for this feedback to debate).
> > > >> 2.  I removed the concept of an external data page for the sake of
> > > trying
> > > >> to remove design options (we should still benchmark this). It also
> I
> > > think
> > > >> eases implementation burden (more on this below).
> > > >> 3.  Removed the new encoding.
> > > >> 4.  I think this is still missing some of the exact changes from
> other
> > > >> PRs, some of those might be in error (please highlight them) and
> some
> > > are
> > > >> because I hope the individual PRs (i.e. the statistics change that
> Alkis
> > > >> proposed can get merged before any proposal)
> > > >>
> > > >> Regarding embedding PAR3 embedding, Alkis's doc [1] highlights
> another
> > > >> option for doing this that might be more robust but slightly more
> > > >> complicated.
> > > >>
> > > >> I think in terms of items already discussed, whether to try to reuse
> > > >> existing structures or use new structures (Alkis is proposing going
> > > >> straight to flatbuffers in this regard IIUC after some more tactical
> > > >> changes).  I think another point raised is the problem with new
> > > structures
> > > >> is they require implementations (e.g. DuckDB) that do not
> encapsulate
> > > >> Thrift well to make potentially much larger structural changes.
> The
> > > way I
> > > >> tried to approach it in my PR is it should be O(days) work to take
> a
> > > PAR3
> > > >> footer and convert it back to PAR1, which will hopefully allow other
> > > >> Parquet parsers in the ecosystems to at least get incorporated
> sooner
> > > even
> > > >> if no performance benefits are seen.
> > > >>
> > > >> Quoting from a separate thread that Alkis Started:
> > > >>
> > > >> 3 is important if we strongly believe that we can get the best
> design
> > > >>> through testing prototypes on real data and measuring the effects
> vs
> > > >>> designing changes in PRs. Along the same lines, I am requesting
> that
> > > you
> > > >>> ask through your contacts/customers (I will do the same) for
> scrubbed
> > > >>> footers of particular interest (wide, deep, etc) so that we can
> build a
> > > >>> set
> > > >>> of real footers on which we can run benchmarks and drive design
> > > >>> decisions.
> > > >>
> > > >>
> > > >> I agree with this sentiment. I think some others who have
> volunteered to
> > > >> work on this have such data and I will see what I can do on my
> end.  I
> > > >> think we should hold off more drastic changes/improvements until we
> can
> > > get
> > > >> better metrics.  But I also don't think we should let the "best" be
> the
> > > >> enemy of the "good".  I hope we can ship a PAR3 footer sooner that
> gets
> > > us
> > > >> a large improvement over the status quo and have it adopted fairly
> > > widely
> > > >> sooner rather than waiting for an optimal design.  I also agree
> leaving
> > > >> room for experimentation is a good idea (I think this can probably
> be
> > > done
> > > >> by combining the methods for embedding that have already been
> discussed
> > > to
> > > >> allow potentially 2 embedded footers).
> > > >>
> > > >> I think another question that Alkis's proposals raised is how
> policies
> > > on
> > > >> deprecation of fields (especially ones that are currently required
> in
> > > >> PAR1).  I think this is probably a better topic for another thread,
> I'll
> > > >> try to write a PR formalizing a proposal on feature evolution.
> > > >>
> > > >>
> > > >>
> > > >> [1]
> > > >>
> > >
> https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit
>
> > > >> [2]
> https://lists.apache.org/thread/zdpswrd4yxrj845rmoopqozhk0vrm6vo
> > > >> [3] https://github.com/apache/parquet-format/pull/250
> > > >>
> > > >> On Tue, May 28, 2024 at 10:56 AM Micah Kornfield <
> [email protected]
> > > >
> > > >> wrote:
> > > >>
> > > >>> Hi Antoine,
> > > >>> Thanks for the great points.  Responses inline.
> > > >>>
> > > >>>
> > > >>>> I like your attempt to put the "new" file metadata after the
> legacy
> > > >>>> one in https://github.com/apache/parquet-format/pull/250, and I
> hope
> > > it
> > > >>>> can actually be made to work (it requires current Parquet readers
> to
> > > >>>> allow/ignore arbitrary padding at the end of the v1 Thrift
> metadata).
> > > >>>
> > > >>>
> > > >>> Thanks (I hope so too).  I think the idea is originally from
> Alkis.  If
> > > >>> it doesn't work then there is always an option of doing a little
> more
> > > >>> involved process of making the footer look like an unknown binary
> > > field (an
> > > >>> approach I know you have objections to).
> > > >>>
> > > >>> I'm biased, but I find it much cleaner to define new Thrift
> > > >>>>   structures (FileMetadataV3, etc.), rather than painstakinly
> document
> > > >>>>   which fields are to be omitted in V3. That would achieve three
> > > goals:
> > > >>>>   1) make the spec easier to read (even though it would be
> physically
> > > >>>>   longer); 2) make it easier to produce a conformant
> implementation
> > > >>>>   (special rules increase the risks of misunderstandings and
> > > >>>>   disagreements); 3) allow a later cleanup of the spec once we
> agree
> > > to
> > > >>>>   get rid of V1 structs.
> > > >>>
> > > >>> There are trade-offs here.  I agree with the benefits you listed
> here.
> > > >>> The benefits of reusing existing structs are:
> > > >>> 1. Lowers the amount of boiler plate code mapping from one to the
> other
> > > >>> (i.e. simpler initial implementation), since I expect it will be
> a
> > > while
> > > >>> before we have standalone PAR3 files.
> > > >>> 2. Allows for lower maintenance burden if there is useful new
> metadata
> > > >>> that we would like to see added to both structures original and
> "V3"
> > > >>> structures.
> > > >>>
> > > >>> - The new encoding in that PR seems like it should be moved to a
> > > >>>>   separate PR and be discussed in the encodings thread?
> > > >>>
> > > >>>
> > > >>> I'll cross post on that thread.  The main reason I included it in
> my
> > > >>> proposal is I think it provides random access for members out of
> the
> > > box
> > > >>> (as compared to the existing encodings).  I think this mostly goes
> to
> > > your
> > > >>> third-point so I'll discuss below.
> > > >>>
> > > >>> - I'm a bit skeptical about moving Thrift lists into data pages,
> rather
> > > >>>>   than, say, just embed the corresponding Thrift serialization as
> > > >>>>   binary fields for lazy deserialization.
> > > >>>
> > > >>> I think this falls into 2 different concerns:
> > > >>> 1.  The format of how we serialize metadata.
> > > >>> 2.  Where the serialized metadata lives.
> > > >>>
> > > >>> For concern #1, I think we should be considering treating these
> lists
> > > as
> > > >>> actual parquet data pages.  This allows users to tune this to
> their
> > > needs
> > > >>> for size vs decoding speed, and make use of any improvements to
> > > encoding
> > > >>> that happen in the future without a spec change. I think this is
> likely
> > > >>> fairly valuable given the number of systems that cache this data.
> The
> > > >>> reason I introduced the new encoding was to provide an option
> that
> > > could be
> > > >>> as efficient as possible from a compute perspective.
> > > >>>
> > > >>> For concern #2, there is no reason encoding a page as a thrift
> Binary
> > > >>> field would not work. The main reason I raised putting them
> outside of
> > > >>> thrift is for greater control on deserialization (the main
> benefit
> > > being
> > > >>> avoiding copies) for implementations that have a Thrift parser
> that
> > > doesn't
> > > >>> allow these optimizations.  In terms of a path forward here, I
> think
> > > >>> understanding the performance and memory characteristics of each
> > > approach.
> > > >>> I agree, if there isn't substantial savings from having them be
> > > outside the
> > > >>> page, then it just adds complexity.
> > > >>>
> > > >>> Thanks,
> > > >>> Micah
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>>
> > > >>> On Tue, May 28, 2024 at 7:06 AM Antoine Pitrou <[email protected]
> >
> > > >>> wrote:
> > > >>>
> > > >>>>
> > > >>>> Hello Micah,
> > > >>>>
> > > >>>> First, kudos for doing this!
> > > >>>>
> > > >>>> I like your attempt to put the "new" file metadata after the
> legacy
> > > >>>> one in https://github.com/apache/parquet-format/pull/250, and I
> hope
> > > it
> > > >>>> can actually be made to work (it requires current Parquet readers
> to
> > > >>>> allow/ignore arbitrary padding at the end of the v1 Thrift
> metadata).
> > > >>>>
> > > >>>> Some assorted comments on other changes that PR is doing:
> > > >>>>
> > > >>>> - I'm biased, but I find it much cleaner to define new Thrift
> > > >>>>   structures (FileMetadataV3, etc.), rather than painstakinly
> document
> > > >>>>   which fields are to be omitted in V3. That would achieve three
> > > goals:
> > > >>>>   1) make the spec easier to read (even though it would be
> physically
> > > >>>>   longer); 2) make it easier to produce a conformant
> implementation
> > > >>>>   (special rules increase the risks of misunderstandings and
> > > >>>>   disagreements); 3) allow a later cleanup of the spec once we
> agree
> > > to
> > > >>>>   get rid of V1 structs.
> > > >>>>
> > > >>>> - The new encoding in that PR seems like it should be moved to a
> > > >>>>   separate PR and be discussed in the encodings thread?
> > > >>>>
> > > >>>> - I'm a bit skeptical about moving Thrift lists into data pages,
> > > rather
> > > >>>>   than, say, just embed the corresponding Thrift serialization as
> > > >>>>   binary fields for lazy deserialization.
> > > >>>>
> > > >>>> Regards
> > > >>>>
> > > >>>> Antoine.
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On Mon, 27 May 2024 23:06:37 -0700
> > > >>>> Micah Kornfield <
> [email protected]>
> > > >>>> wrote:
> > > >>>> > As a follow-up to the "V3" Discussions [1][2] I wanted to start
> a
> > > >>>> thread on
> > > >>>> > improvements to the footer metadata.
> > > >>>> >
> > > >>>> > Based on conversation so far, there have been a few proposals
> > > >>>> [3][4][5] to
> > > >>>> > help better support files with wide schemas and many
> row-groups.  I
> > > >>>> think
> > > >>>> > there are a lot of interesting ideas in each. It would be good
> to
> > > get
> > > >>>> > further feedback on these to make sure we aren't missing
> anything
> > > and
> > > >>>> > define a minimal first iteration for doing experimental
> benchmarking
> > > >>>> to
> > > >>>> > prove out an approach.
> > > >>>> >
> > > >>>> > I think the next steps would ideally be:
> > > >>>> > 1.  Come to a consensus on the overall approach.
> > > >>>> > 2.  Prototypes to Benchmark/test to validate the approaches
> defined
> > > >>>> (if we
> > > >>>> > can't come to consensus in item #1, this might help choose a
> > > >>>> direction).
> > > >>>> > 3.  Divide up any final approach into as fine-grained features
> as
> > > >>>> possible.
> > > >>>> > 4.  Implement across parquet-java, parquet-cpp, parquet-rs (and
> any
> > > >>>> other
> > > >>>> > implementations that we can get volunteers for).  Additionally,
> if
> > > >>>> new APIs
> > > >>>> > are needed to make use of the new structure, it would be good
> to try
> > > >>>> to
> > > >>>> > prototype against consumers of Parquet.
> > > >>>> >
> > > >>>> > Knowing that we have enough people interested in doing #3 is
> > > critical
> > > >>>> to
> > > >>>> > success, so if you have time to devote, it would be helpful to
> chime
> > > >>>> in
> > > >>>> > here (I know some people already noted they could help in the
> > > original
> > > >>>> > thread).
> > > >>>> >
> > > >>>> > I think it is likely we will need either an in person sync or
> > > another
> > > >>>> more
> > > >>>> > focused design document could help. I am happy to try to
> facilitate
> > > >>>> this
> > > >>>> > (once we have a better sense of who wants to be involved and
> what
> > > time
> > > >>>> > zones they are in I can schedule a sync if necessary).
> > > >>>> >
> > > >>>> > Thanks,
> > > >>>> > Micah
> > > >>>> >
> > > >>>> > [1]
> > > https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
> > > >>>> > [2]
> > > >>>> >
> > > >>>>
> > >
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
>
> > > >>>> > [3] https://github.com/apache/parquet-format/pull/242
> > > >>>> > [4] https://github.com/apache/parquet-format/pull/248
> > > >>>> > [5] https://github.com/apache/parquet-format/pull/250
> > > >>>> >
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>>
> > >
> >
>
>
>
>

Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

Reply via email to