Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

Antoine Pitrou Wed, 05 Jun 2024 02:47:09 -0700


Google docs tend to get lost very quickly. My experience with the
Python PEP process leads me to a preference for a .md file in the repo,
that can be collectively owned and rely on regular GH-based review
tools.


Regards

Antoine.



On Tue, 4 Jun 2024 18:52:42 -0700
Julien Le Dem <[email protected]> wrote:
> I agree that flatbuffer is a good option if we are happy with the perf and
> it let's access column metadata in O(1) without reading other columns.
> If we're going to make an incompatible metadata change, let's make it once
> with a transition path to easily move from PAR1 to PAR3 letting them
> coexist in a backward compatible phase for a while.
> 
> I think that before voting on this, we should summarize in a doc the whole
> PAR3 footer metadata discussion:
> 1) Goals: (O(1) random access, extensibility, ...)
> 2) preferred option
> 3) migration path.
> 4) mention other options we considered and why we didn't pick them (this
> doesn't have to be extensive)
> 
> That will make it easier for people who are impacted but haven't actively
> contributed to this discussion so far to review and chime in.
> This is a big change with large potential impact here.
> 
> Do people prefer google doc or a PR with a .md for this? I personally like
> google docs (we can copy it in the repo after approval)
> 
> Julien
> 
> 
> On Tue, Jun 4, 2024 at 1:53 AM Alkis Evlogimenos
> <[email protected]> wrote:
> 
> > >
> > > Finally, one point I wanted to highlight here (I also mentioned it in the
> > > PR): If we want random access, we have to abolish the concept that the  
> > data  
> > > in the columns array is in a different order than in the schema. Your PR
> > > [1] even added a new field schema_index for matching between
> > > ColumnMetaData and schema position, but this kills random access. If I  
> > want  
> > > to read the third column in the schema, then do a O(1) random access into
> > > the third column chunk only to notice that it's schema index is totally
> > > different and therefore I need a full exhaustive search to find the  
> > column  
> > > that actually belongs to the third column in the schema, then all our
> > > random access efforts are in vain.  
> >
> >
> > `schema_index` is useful to implement
> > https://issues.apache.org/jira/browse/PARQUET-183 which is more and more
> > prevalent as schemata become wider.
> >
> > On Mon, Jun 3, 2024 at 5:54 PM Micah Kornfield <[email protected]>
> > wrote:
> >  
> > > Thanks everyone for chiming in.  Some responses inline:
> > >  
> > > >
> > > > The thrift
> > > > decoder just has to be invoked recursively whenever such a lazy field  
> > is  
> > > > required. This is nice, but since it doesn't give us random access into
> > > > lists, it's also only partially helpful.  
> > >
> > > This point is moot if we move to flatbuffers but I think part of the
> > > proposals are either using list<binary> or providing arrow-like offsets
> > > into the serialized binary to support random access of elements.
> > >
> > >  
> > > > I don't fully understand this point, can you elaborate on it. It feels  
> > > like  
> > > > a non-issue or a super edge case to me. Is this just a DuckDB issue? If 
> > > >  
> > > so,  
> > > > I am very sure they're happy to change this, as they're quite active  
> > and  
> > > > also strive for simplicity and I would argue that exposing thrift  
> > > directly  
> > > > isn't that.  
> > >
> > > IIUC, I don't think Thrift is public from an end-user perspective.  It is
> > > however public in the fact that internally DuckDB exposes the Thrift
> > > structs directly to consuming code.
> > >
> > > * I don't think there is value in providing a 1-to-1 mapping from the  
> > > >   old footer encoding to the new encoding. On the contrary, this is the
> > > >   opportunity to clean up and correct some of the oddities that have
> > > >   accumulated in the past.  
> > >
> > > I think I should clarify this, as I see a few distinct cases here:
> > >
> > > 1.  Removing duplication/redundancy that accumulated over the years for
> > > backwards compatibility.
> > > 2.  Removing fields that were never used in practice.
> > > 3.  Changing the layout of fields (e.g. moving from array of structs to
> > > struct of arrays) for performance considerations.
> > > 4.  Writing potentially less metadata (e.g. summarization of metadata
> > > today).
> > >
> > > IMO, I think we should be doing 1,2, and 3.  I don't think we should be
> > > doing 4 (e.g. as a concrete example, see the discussion on
> > > PageEncodingStats [1]).
> > >
> > > If we want random access, we have to abolish the concept that the data  
> > > > in the columns array is in a different order than in the schema. Your  
> > PR  
> > > > [1] even added a new field schema_index for matching between  
> > > ColumnMetaData  
> > > > and schema position, but this kills random access.  
> > >
> > >
> > > I think this is a larger discussion that should be split off, as I don't
> > > think it should block the core work here.  This was adapted from another
> > > proposal, that I think had different ideas on how possible rework column
> > > selection (it seems this would be on a per RowGroup basis).
> > >
> > > [1] https://github.com/apache/parquet-format/pull/250/files#r1620984136
> > >
> > >
> > > On Mon, Jun 3, 2024 at 8:20 AM Antoine Pitrou <[email protected]>  
> > wrote:  
> > >  
> > > >
> > > > Everything Jan said below aligns closely with my opinion.
> > > >
> > > > * +1 for going directly to Flatbuffers for the new footer format *if*
> > > >   there is a general agreement that moving to Flatbuffers at some point
> > > >   is desirable (including from a software ecosystem point of view).
> > > >
> > > > * I don't think there is value in providing a 1-to-1 mapping from the
> > > >   old footer encoding to the new encoding. On the contrary, this is the
> > > >   opportunity to clean up and correct some of the oddities that have
> > > >   accumulated in the past.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > On Mon, 3 Jun 2024 15:58:40 +0200
> > > > Jan Finis <[email protected]> wrote:  
> > > > > Interesting discussion so far, thanks for driving this Micah! A few  
> > > > points  
> > > > > from my side:
> > > > >
> > > > > When considering flatbuffers vs. lazy "binary" nested thrift, vs. own
> > > > > MetaDataPage format, let's also keep architectural simplicity in  
> > mind.  
> > > > >
> > > > > For example, introducing flatbuffers might sound like a big change at
> > > > > first, but at least it is then *one format* for everything. In  
> > > contrast,  
> > > > > thrift + custom MetaDataPage is two formats. My gut feeling estimate
> > > > > would be that it is probably easier to just introduce a flatbuffers  
> > > > reader  
> > > > > instead of special casing some thrift to instead need a custom  
> > > > MetaDataPage  
> > > > > reader.
> > > > >
> > > > > The lazy thrift "hack" is something in between the two. It is  
> > probably  
> > > > the  
> > > > > easiest to adopt, as no new reading logic needs to be written. The  
> > > thrift  
> > > > > decoder just has to be invoked recursively whenever such a lazy field 
> > > > >  
> > > is  
> > > > > required. This is nice, but since it doesn't give us random access  
> > into  
> > > > > lists, it's also only partially helpful.
> > > > >
> > > > > Given all this, from the implementation / architectural cleanliness  
> > > > side, I  
> > > > > guess I would prefer just using flatbuffers, unless we find big
> > > > > disadvantages with this. This also brings us closer to Arrow,  
> > although  
> > > > > that's not too important here.
> > > > >
> > > > >
> > > > >  
> > > > > > 1.  I think for an initial revision of metadata we should make it  
> > > > possible  
> > > > > > to have a 1:1 mapping between PAR1 footers and whatever is included 
> > > > > >  
> > > in  
> > > > the  
> > > > > > new footer.  The rationale for this is to let implementations that  
> > > > haven't  
> > > > > > abstracted out thrift structures an easy path to incorporating the  
> > > new  
> > > > > > footer (i.e. just do translation at the boundaries).
> > > > > >  
> > > > >
> > > > > I don't fully understand this point, can you elaborate on it. It  
> > feels  
> > > > like  
> > > > > a non-issue or a super edge case to me. Is this just a DuckDB issue?  
> > If  
> > > > so,  
> > > > > I am very sure they're happy to change this, as they're quite active  
> > > and  
> > > > > also strive for simplicity and I would argue that exposing thrift  
> > > > directly  
> > > > > isn't that. Our database also allows metadata access in SQL, but we
> > > > > transcode the thrift into JSON. Given that JSON is pretty standard in
> > > > > databases while thrift isn't, I'm sure DuckDB devs will see it the  
> > > same.  
> > > > >
> > > > >
> > > > > Finally, one point I wanted to highlight here (I also mentioned it in 
> > > > >  
> > > the  
> > > > > PR): If we want random access, we have to abolish the concept that  
> > the  
> > > > data  
> > > > > in the columns array is in a different order than in the schema. Your 
> > > > >  
> > > PR  
> > > > > [1] even added a new field schema_index for matching between  
> > > > ColumnMetaData  
> > > > > and schema position, but this kills random access. If I want to read  
> > > the  
> > > > > third column in the schema, then do a O(1) random access into the  
> > third  
> > > > > column chunk only to notice that it's schema index is totally  
> > different  
> > > > and  
> > > > > therefore I need a full exhaustive search to find the column that  
> > > > actually  
> > > > > belongs to the third column in the schema, then all our random access
> > > > > efforts are in vain.
> > > > >
> > > > > Therefore, the only possible way to make random access useful is to  
> > > > mandate  
> > > > > that ColumnMetaData in the columns list has to be in exactly the same 
> > > > >  
> > > > order  
> > > > > in which the columns appear in the schema.
> > > > >
> > > > > Cheers,
> > > > > Jan
> > > > >
> > > > > [1] https://github.com/apache/parquet-format/pull/250
> > > > >
> > > > >
> > > > > Am Sa., 1. Juni 2024 um 10:38 Uhr schrieb Micah Kornfield <
> > > > > [email protected]>:
> > > > >  
> > > > > > As an update here/some responses.  Alkis [3] is making considerable
> > > > > > progress on a Flatbuffer alternative that shows good performance  
> > > > benchmarks  
> > > > > > on some real sample footers (and hopefully soon some synthetic data 
> > > > > >  
> > > > from  
> > > > > > Rok).
> > > > > >
> > > > > > The approaches that currently have public PRs [1][2] IIUC mostly  
> > save  
> > > > time  
> > > > > > by lazily decompressing thrift metadata (some of the details differ 
> > > > > >  
> > > > but it  
> > > > > > is effectively the same mechanism).  This helps for cases when  
> > only a  
> > > > few  
> > > > > > row groups/columns are needed but in the limit has the same  
> > > theoretical  
> > > > > > performance penalties for full table reads.
> > > > > >
> > > > > > I would like to get people's take on two points:
> > > > > > 1.  I think for an initial revision of metadata we should make it  
> > > > possible  
> > > > > > to have a 1:1 mapping between PAR1 footers and whatever is included 
> > > > > >  
> > > in  
> > > > the  
> > > > > > new footer.  The rationale for this is to let implementations that  
> > > > haven't  
> > > > > > abstracted out thrift structures an easy path to incorporating the  
> > > new  
> > > > > > footer (i.e. just do translation at the boundaries).
> > > > > > 2.  Do people see value in trying to do a Thrift only iteration  
> > which  
> > > > > > addresses the use-case of scanning only a select number of row
> > > > > > groups/columns?  Or if Flatbuffers offer an overall better  
> > > performance  
> > > > > > should we jump to using it?
> > > > > >
> > > > > > After processing the comments I think we might want to discuss the  
> > > > > > > extension point  
> > https://github.com/apache/parquet-format/pull/254  
> > > > > > >  separately.  
> > > > > >
> > > > > >
> > > > > > I think this is already getting reviewed (I also think we touched  
> > on  
> > > > it in  
> > > > > > the extensibility thread).  Since this is really just defining how  
> > we  
> > > > can  
> > > > > > encapsulate data and doesn't involve any upfront work, I think once
> > > > > > everyone has had a chance to comment on it we can hopefully hold a  
> > > > vote on  
> > > > > > it (hopefully in the next week or 2).  I think the only other  
> > viable  
> > > > > > alternative is what is proposed in [2] which doesn't involve any  
> > > > mucking  
> > > > > > with Thrift bytes but poses a slightly larger compatibility risk.
> > > > > >
> > > > > > Thanks,
> > > > > > Micah
> > > > > >
> > > > > > [1] https://github.com/apache/parquet-format/pull/242
> > > > > > [2] https://github.com/apache/parquet-format/pull/250
> > > > > > [3]
> > > > > >
> > > > > >  
> > > >  
> > >  
> > https://github.com/apache/parquet-format/pull/250#pullrequestreview-2091174869
> >   
> > > > > >
> > > > > > On Thu, May 30, 2024 at 7:21 AM Alkis Evlogimenos <
> > > > > > [email protected]> wrote:
> > > > > >  
> > > > > > > Thank you for summarizing Micah and thanks to everyone commenting 
> > > > > > >  
> > > on  
> > > > the  
> > > > > > > proposal and PRs.
> > > > > > >
> > > > > > > After processing the comments I think we might want to discuss  
> > the  
> > > > > > > extension point  
> > https://github.com/apache/parquet-format/pull/254  
> > > > > > > separately.
> > > > > > >
> > > > > > > The extension point will allow vendors to experiment on different 
> > > > > > >  
> > > > > > metadata  
> > > > > > > (be it FileMetaData, or ColumnMetaData etc) and when a design is  
> > > > ready  
> > > > > > and  
> > > > > > > validated in large scale, it can be discussed for inclusion to  
> > the  
> > > > > > official  
> > > > > > > specification.
> > > > > > >
> > > > > > > On Thu, May 30, 2024 at 9:37 AM Micah Kornfield <  
> > > > [email protected]>  
> > > > > > > wrote:
> > > > > > >  
> > > > > > >> As an update Alkis wrote up a nice summary of his thoughts  
> > [1][2].  
> > > > > > >>
> > > > > > >> I updated my PR <  
> > > https://github.com/apache/parquet-format/pull/250>  
> > > > > > >> [3] to be more complete.  At a high-level (for those that have  
> > > > already  
> > > > > > >> reviewed):
> > > > > > >> 1. I converted more fields to use page-encoding (or added a  
> > binary  
> > > > field  
> > > > > > >> for thrift serialized encoding when they are expected to be  
> > > small).  
> > > > > > >> This might be overdone (happy for this feedback to debate).
> > > > > > >> 2.  I removed the concept of an external data page for the sake  
> > of  
> > > > > > trying  
> > > > > > >> to remove design options (we should still benchmark this). It  
> > also  
> > > > I  
> > > > > > think  
> > > > > > >> eases implementation burden (more on this below).
> > > > > > >> 3.  Removed the new encoding.
> > > > > > >> 4.  I think this is still missing some of the exact changes from 
> > > > > > >>  
> > > > other  
> > > > > > >> PRs, some of those might be in error (please highlight them) and 
> > > > > > >>  
> > > > some  
> > > > > > are  
> > > > > > >> because I hope the individual PRs (i.e. the statistics change  
> > that  
> > > > Alkis  
> > > > > > >> proposed can get merged before any proposal)
> > > > > > >>
> > > > > > >> Regarding embedding PAR3 embedding, Alkis's doc [1] highlights  
> > > > another  
> > > > > > >> option for doing this that might be more robust but slightly  
> > more  
> > > > > > >> complicated.
> > > > > > >>
> > > > > > >> I think in terms of items already discussed, whether to try to  
> > > reuse  
> > > > > > >> existing structures or use new structures (Alkis is proposing  
> > > going  
> > > > > > >> straight to flatbuffers in this regard IIUC after some more  
> > > tactical  
> > > > > > >> changes).  I think another point raised is the problem with new  
> > > > > > structures  
> > > > > > >> is they require implementations (e.g. DuckDB) that do not  
> > > > encapsulate  
> > > > > > >> Thrift well to make potentially much larger structural changes.  
> > > > The  
> > > > > > way I  
> > > > > > >> tried to approach it in my PR is it should be O(days) work to  
> > take  
> > > > a  
> > > > > > PAR3  
> > > > > > >> footer and convert it back to PAR1, which will hopefully allow  
> > > other  
> > > > > > >> Parquet parsers in the ecosystems to at least get incorporated  
> > > > sooner  
> > > > > > even  
> > > > > > >> if no performance benefits are seen.
> > > > > > >>
> > > > > > >> Quoting from a separate thread that Alkis Started:
> > > > > > >>
> > > > > > >> 3 is important if we strongly believe that we can get the best  
> > > > design  
> > > > > > >>> through testing prototypes on real data and measuring the  
> > effects  
> > > > vs  
> > > > > > >>> designing changes in PRs. Along the same lines, I am requesting 
> > > > > > >>>  
> > > > that  
> > > > > > you  
> > > > > > >>> ask through your contacts/customers (I will do the same) for  
> > > > scrubbed  
> > > > > > >>> footers of particular interest (wide, deep, etc) so that we can 
> > > > > > >>>  
> > > > build a  
> > > > > > >>> set
> > > > > > >>> of real footers on which we can run benchmarks and drive design
> > > > > > >>> decisions.  
> > > > > > >>
> > > > > > >>
> > > > > > >> I agree with this sentiment. I think some others who have  
> > > > volunteered to  
> > > > > > >> work on this have such data and I will see what I can do on my  
> > > > end.  I  
> > > > > > >> think we should hold off more drastic changes/improvements until 
> > > > > > >>  
> > > we  
> > > > can  
> > > > > > get  
> > > > > > >> better metrics.  But I also don't think we should let the "best" 
> > > > > > >>  
> > > be  
> > > > the  
> > > > > > >> enemy of the "good".  I hope we can ship a PAR3 footer sooner  
> > that  
> > > > gets  
> > > > > > us  
> > > > > > >> a large improvement over the status quo and have it adopted  
> > fairly  
> > > > > > widely  
> > > > > > >> sooner rather than waiting for an optimal design.  I also agree  
> > > > leaving  
> > > > > > >> room for experimentation is a good idea (I think this can  
> > probably  
> > > > be  
> > > > > > done  
> > > > > > >> by combining the methods for embedding that have already been  
> > > > discussed  
> > > > > > to  
> > > > > > >> allow potentially 2 embedded footers).
> > > > > > >>
> > > > > > >> I think another question that Alkis's proposals raised is how  
> > > > policies  
> > > > > > on  
> > > > > > >> deprecation of fields (especially ones that are currently  
> > required  
> > > > in  
> > > > > > >> PAR1).  I think this is probably a better topic for another  
> > > thread,  
> > > > I'll  
> > > > > > >> try to write a PR formalizing a proposal on feature evolution.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> [1]
> > > > > > >>  
> > > > > >  
> > > >  
> > >  
> > https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit
> >   
> > > >  
> > > > > > >> [2]  
> > > > https://lists.apache.org/thread/zdpswrd4yxrj845rmoopqozhk0vrm6vo  
> > > > > > >> [3] https://github.com/apache/parquet-format/pull/250
> > > > > > >>
> > > > > > >> On Tue, May 28, 2024 at 10:56 AM Micah Kornfield <  
> > > > [email protected]  
> > > > > > >  
> > > > > > >> wrote:
> > > > > > >>  
> > > > > > >>> Hi Antoine,
> > > > > > >>> Thanks for the great points.  Responses inline.
> > > > > > >>>
> > > > > > >>>  
> > > > > > >>>> I like your attempt to put the "new" file metadata after the  
> > > > legacy  
> > > > > > >>>> one in https://github.com/apache/parquet-format/pull/250,  
> > and I  
> > > > hope  
> > > > > > it  
> > > > > > >>>> can actually be made to work (it requires current Parquet  
> > > readers  
> > > > to  
> > > > > > >>>> allow/ignore arbitrary padding at the end of the v1 Thrift  
> > > > metadata).  
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> Thanks (I hope so too).  I think the idea is originally from  
> > > > Alkis.  If  
> > > > > > >>> it doesn't work then there is always an option of doing a  
> > little  
> > > > more  
> > > > > > >>> involved process of making the footer look like an unknown  
> > binary  
> > > > > > field (an  
> > > > > > >>> approach I know you have objections to).
> > > > > > >>>
> > > > > > >>> I'm biased, but I find it much cleaner to define new Thrift  
> > > > > > >>>>   structures (FileMetadataV3, etc.), rather than painstakinly  
> > > > document  
> > > > > > >>>>   which fields are to be omitted in V3. That would achieve  
> > three  
> > > > > > goals:  
> > > > > > >>>>   1) make the spec easier to read (even though it would be  
> > > > physically  
> > > > > > >>>>   longer); 2) make it easier to produce a conformant  
> > > > implementation  
> > > > > > >>>>   (special rules increase the risks of misunderstandings and
> > > > > > >>>>   disagreements); 3) allow a later cleanup of the spec once we 
> > > > > > >>>>  
> > > > agree  
> > > > > > to  
> > > > > > >>>>   get rid of V1 structs.  
> > > > > > >>>
> > > > > > >>> There are trade-offs here.  I agree with the benefits you  
> > listed  
> > > > here.  
> > > > > > >>> The benefits of reusing existing structs are:
> > > > > > >>> 1. Lowers the amount of boiler plate code mapping from one to  
> > the  
> > > > other  
> > > > > > >>> (i.e. simpler initial implementation), since I expect it will  
> > be  
> > > > a  
> > > > > > while  
> > > > > > >>> before we have standalone PAR3 files.
> > > > > > >>> 2. Allows for lower maintenance burden if there is useful new  
> > > > metadata  
> > > > > > >>> that we would like to see added to both structures original and 
> > > > > > >>>  
> > > > "V3"  
> > > > > > >>> structures.
> > > > > > >>>
> > > > > > >>> - The new encoding in that PR seems like it should be moved to  
> > a  
> > > > > > >>>>   separate PR and be discussed in the encodings thread?  
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> I'll cross post on that thread.  The main reason I included it  
> > in  
> > > > my  
> > > > > > >>> proposal is I think it provides random access for members out  
> > of  
> > > > the  
> > > > > > box  
> > > > > > >>> (as compared to the existing encodings).  I think this mostly  
> > > goes  
> > > > to  
> > > > > > your  
> > > > > > >>> third-point so I'll discuss below.
> > > > > > >>>
> > > > > > >>> - I'm a bit skeptical about moving Thrift lists into data  
> > pages,  
> > > > rather  
> > > > > > >>>>   than, say, just embed the corresponding Thrift serialization 
> > > > > > >>>>  
> > > as  
> > > > > > >>>>   binary fields for lazy deserialization.  
> > > > > > >>>
> > > > > > >>> I think this falls into 2 different concerns:
> > > > > > >>> 1.  The format of how we serialize metadata.
> > > > > > >>> 2.  Where the serialized metadata lives.
> > > > > > >>>
> > > > > > >>> For concern #1, I think we should be considering treating these 
> > > > > > >>>  
> > > > lists  
> > > > > > as  
> > > > > > >>> actual parquet data pages.  This allows users to tune this to  
> > > > their  
> > > > > > needs  
> > > > > > >>> for size vs decoding speed, and make use of any improvements to 
> > > > > > >>>  
> > > > > > encoding  
> > > > > > >>> that happen in the future without a spec change. I think this  
> > is  
> > > > likely  
> > > > > > >>> fairly valuable given the number of systems that cache this  
> > data.  
> > > > The  
> > > > > > >>> reason I introduced the new encoding was to provide an option  
> > > > that  
> > > > > > could be  
> > > > > > >>> as efficient as possible from a compute perspective.
> > > > > > >>>
> > > > > > >>> For concern #2, there is no reason encoding a page as a thrift  
> > > > Binary  
> > > > > > >>> field would not work. The main reason I raised putting them  
> > > > outside of  
> > > > > > >>> thrift is for greater control on deserialization (the main  
> > > > benefit  
> > > > > > being  
> > > > > > >>> avoiding copies) for implementations that have a Thrift parser  
> > > > that  
> > > > > > doesn't  
> > > > > > >>> allow these optimizations.  In terms of a path forward here, I  
> > > > think  
> > > > > > >>> understanding the performance and memory characteristics of  
> > each  
> > > > > > approach.  
> > > > > > >>> I agree, if there isn't substantial savings from having them be 
> > > > > > >>>  
> > > > > > outside the  
> > > > > > >>> page, then it just adds complexity.
> > > > > > >>>
> > > > > > >>> Thanks,
> > > > > > >>> Micah
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> On Tue, May 28, 2024 at 7:06 AM Antoine Pitrou <  
> > > [email protected]  
> > > > >  
> > > > > > >>> wrote:
> > > > > > >>>  
> > > > > > >>>>
> > > > > > >>>> Hello Micah,
> > > > > > >>>>
> > > > > > >>>> First, kudos for doing this!
> > > > > > >>>>
> > > > > > >>>> I like your attempt to put the "new" file metadata after the  
> > > > legacy  
> > > > > > >>>> one in https://github.com/apache/parquet-format/pull/250,  
> > and I  
> > > > hope  
> > > > > > it  
> > > > > > >>>> can actually be made to work (it requires current Parquet  
> > > readers  
> > > > to  
> > > > > > >>>> allow/ignore arbitrary padding at the end of the v1 Thrift  
> > > > metadata).  
> > > > > > >>>>
> > > > > > >>>> Some assorted comments on other changes that PR is doing:
> > > > > > >>>>
> > > > > > >>>> - I'm biased, but I find it much cleaner to define new Thrift
> > > > > > >>>>   structures (FileMetadataV3, etc.), rather than painstakinly  
> > > > document  
> > > > > > >>>>   which fields are to be omitted in V3. That would achieve  
> > three  
> > > > > > goals:  
> > > > > > >>>>   1) make the spec easier to read (even though it would be  
> > > > physically  
> > > > > > >>>>   longer); 2) make it easier to produce a conformant  
> > > > implementation  
> > > > > > >>>>   (special rules increase the risks of misunderstandings and
> > > > > > >>>>   disagreements); 3) allow a later cleanup of the spec once we 
> > > > > > >>>>  
> > > > agree  
> > > > > > to  
> > > > > > >>>>   get rid of V1 structs.
> > > > > > >>>>
> > > > > > >>>> - The new encoding in that PR seems like it should be moved  
> > to a  
> > > > > > >>>>   separate PR and be discussed in the encodings thread?
> > > > > > >>>>
> > > > > > >>>> - I'm a bit skeptical about moving Thrift lists into data  
> > pages,  
> > > > > > rather  
> > > > > > >>>>   than, say, just embed the corresponding Thrift serialization 
> > > > > > >>>>  
> > > as  
> > > > > > >>>>   binary fields for lazy deserialization.
> > > > > > >>>>
> > > > > > >>>> Regards
> > > > > > >>>>
> > > > > > >>>> Antoine.
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Mon, 27 May 2024 23:06:37 -0700
> > > > > > >>>> Micah Kornfield <  
> > > > emkornfield-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org>
> > > >   
> > > > > > >>>> wrote:  
> > > > > > >>>> > As a follow-up to the "V3" Discussions [1][2] I wanted to  
> > > start  
> > > > a  
> > > > > > >>>> thread on  
> > > > > > >>>> > improvements to the footer metadata.
> > > > > > >>>> >
> > > > > > >>>> > Based on conversation so far, there have been a few  
> > proposals  
> > > > > > >>>> [3][4][5] to  
> > > > > > >>>> > help better support files with wide schemas and many  
> > > > row-groups.  I  
> > > > > > >>>> think  
> > > > > > >>>> > there are a lot of interesting ideas in each. It would be  
> > good  
> > > > to  
> > > > > > get  
> > > > > > >>>> > further feedback on these to make sure we aren't missing  
> > > > anything  
> > > > > > and  
> > > > > > >>>> > define a minimal first iteration for doing experimental  
> > > > benchmarking  
> > > > > > >>>> to  
> > > > > > >>>> > prove out an approach.
> > > > > > >>>> >
> > > > > > >>>> > I think the next steps would ideally be:
> > > > > > >>>> > 1.  Come to a consensus on the overall approach.
> > > > > > >>>> > 2.  Prototypes to Benchmark/test to validate the approaches  
> > > > defined  
> > > > > > >>>> (if we  
> > > > > > >>>> > can't come to consensus in item #1, this might help choose a 
> > > > > > >>>> >  
> > > > > > >>>> direction).  
> > > > > > >>>> > 3.  Divide up any final approach into as fine-grained  
> > features  
> > > > as  
> > > > > > >>>> possible.  
> > > > > > >>>> > 4.  Implement across parquet-java, parquet-cpp, parquet-rs  
> > > (and  
> > > > any  
> > > > > > >>>> other  
> > > > > > >>>> > implementations that we can get volunteers for).  
> > > Additionally,  
> > > > if  
> > > > > > >>>> new APIs  
> > > > > > >>>> > are needed to make use of the new structure, it would be  
> > good  
> > > > to try  
> > > > > > >>>> to  
> > > > > > >>>> > prototype against consumers of Parquet.
> > > > > > >>>> >
> > > > > > >>>> > Knowing that we have enough people interested in doing #3 is 
> > > > > > >>>> >  
> > > > > > critical  
> > > > > > >>>> to  
> > > > > > >>>> > success, so if you have time to devote, it would be helpful  
> > to  
> > > > chime  
> > > > > > >>>> in  
> > > > > > >>>> > here (I know some people already noted they could help in  
> > the  
> > > > > > original  
> > > > > > >>>> > thread).
> > > > > > >>>> >
> > > > > > >>>> > I think it is likely we will need either an in person sync  
> > or  
> > > > > > another  
> > > > > > >>>> more  
> > > > > > >>>> > focused design document could help. I am happy to try to  
> > > > facilitate  
> > > > > > >>>> this  
> > > > > > >>>> > (once we have a better sense of who wants to be involved and 
> > > > > > >>>> >  
> > > > what  
> > > > > > time  
> > > > > > >>>> > zones they are in I can schedule a sync if necessary).
> > > > > > >>>> >
> > > > > > >>>> > Thanks,
> > > > > > >>>> > Micah
> > > > > > >>>> >
> > > > > > >>>> > [1]  
> > > > > > https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo  
> > > > > > >>>> > [2]
> > > > > > >>>> >  
> > > > > > >>>>  
> > > > > >  
> > > >  
> > >  
> > https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> >   
> > > >  
> > > > > > >>>> > [3] https://github.com/apache/parquet-format/pull/242
> > > > > > >>>> > [4] https://github.com/apache/parquet-format/pull/248
> > > > > > >>>> > [5] https://github.com/apache/parquet-format/pull/250
> > > > > > >>>> >  
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>>  
> > > > > >  
> > > > >  
> > > >
> > > >
> > > >
> > > >  
> > >  
> >  
>

Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

Reply via email to