Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

Alkis Evlogimenos Wed, 05 Jun 2024 13:12:34 -0700

Given this:
https://github.com/apache/parquet-format/pull/254/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R321-R331


A sophisticated reader, reads the tail of a parquet file and does this:

Result<MetaData> ParseFooter(std::string_view tail) {
  auto sz = tail.size();
  if (sz < 8) return Error("Invalid footer");
  if (tail.substr(sz - 4) != "PAR1") return Error("Invalid footer");

  auto len = ReadLE32(&tail[sz - 8]);
  if (len > tail.size() - 8) return NotEnoughData();
  FileMetaData meta;
  if (!ParseThrift(tail.substr(sz - 8 - len, len), &meta)) return
Error("Invalid thrift");
  return ToMetaData(meta);
}

If the reader knows about the flatbuffer extension it would do the above
slightly differently:

Result<MetaData> ParseFooterExt(std::string_view ext) {
  auto sz = ext.size();
  if (ext.substr(sz - 4) != std::string_view("DBR\0", 4)) return
Error("Invalid extension");
  auto crc_ext = ReadLE32(&ext[sz - 8]);
  auto len = ReadLE32(&ext[sz - 12]);
  auto crc_len = ReadLE32(&ext[sz - 16]);
  if (crc32(&len, 4) != crc_len) return Error("Invalid extension");
  if (len > ext.size() - 16) return NotEnoughData();
  auto data = ext.substr(sz - 16 - len, len);
  if (crc32(&data[0], data.size()) != crc_ext) return Error("Invalid
extension");
  return ToMetaData(ToFlatBuffer(data));
}

Result<MetaData> ParseFooter(std::string_view tail) {
   auto sz = tail.size();
   if (sz < 8) return Error("Invalid footer");
   if (tail.substr(sz - 4) != "PAR1") return Error("Invalid footer");
   // Let's check for DBR extension, a flat buffer footer.
   if (sz > 4 /* magic */ + 4 /* len */ + 16 /* extension trailer */) {
     auto ext = tail.substr(0, sz - 8);  // perhaps extension
     auto res = ParseFooterExt(tail.substr(0, sz - 8));
     if (res.ok()) return ToMetaData(res.value());
   }
   auto len = ReadLE32(&tail[sz - 8]);
   if (len > tail.size() - 8) return NotEnoughData();
   FileMetaData meta;
   if (!ParseThrift(tail.substr(sz - 8 - len, len), &meta)) return
Error("Invalid thrift");
   return ToMetaData(meta);
}

Disclaimer: untested code, for exposition only

It is important that ParseFooterExt does not need to know anything about
thrift. This is something Antoine put forth as a requirement and it makes a
lot of the evolution of the format a lot easier. In addition the tail does
not need to contain both the old footer and the new to be useful, we can
parse it backwards and still be at 1 fetch from the object store, as long
as we fetched the full flatbuffer footer in the tail.


On Wed, Jun 5, 2024 at 9:38 PM Jan Finis <[email protected]> wrote:

> Got it, makes sense.
>
> How do we expect readers who want the FlatBuffer footer to get it? Would we
> use [1], storing information about the location of the FlatBuffer footer at
> the end of the file? Or would a reader just read the thrift footer and
> extract the FlatBuffer footer?
>
> Cheers,
> Jan
>
> [1]  https://github.com/apache/parquet-format/pull/250
>
> Am Mi., 5. Juni 2024 um 21:12 Uhr schrieb Alkis Evlogimenos
> <[email protected]>:
>
> > The thrift extension can allow us to append a flatbuffer footer to the
> > thrift FileMetaData as described here:
> >
> >
> https://github.com/apache/parquet-format/pull/254/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5R300-R318
> > .
> >
> > Then all readers will skip the flatbuffer footer. The reader(s) that know
> > about the flatbuffer footer will use that in practice. The point of the
> > extension mechanism is to allow different organizations to do such
> > experiments without confusing other readers.
> >
> > On Wed, Jun 5, 2024 at 7:07 PM Jan Finis <[email protected]> wrote:
> >
> > > How would a thrift extension help if we'll be moving to FlatBuffers for
> > > metadata? How do the two things work together? How are we planning to
> > > extend FlatBuffers?
> > >
> > >
> > >
> > > Am Mi., 5. Juni 2024 um 18:48 Uhr schrieb Micah Kornfield <
> > > [email protected]>:
> > >
> > > > >
> > > > > 1. ratify https://github.com/apache/parquet-format/pull/254 as the
> > > > > extension mechanism for parquet. With this we can experiment on new
> > > > footers
> > > > > without having to specify anything else.
> > > >
> > > >
> > > > I think we have probably reached a lazy consensus that is
> reasonable. I
> > > > think I misspoke earlier but we should at least have parquet-java
> and a
> > > > second implementation showing that we can write out the arbitrary
> bytes
> > > > without too much issue (and also read the a file that is written in
> > this
> > > > format).  Alkis would you be able to do this?
> > > >
> > > >
> > > > 3. collaborate on a couple of prototypes, test them in production and
> > > come
> > > > > up with a report advocating for their inclusion to parquet proper.
> > With
> > > > (1)
> > > > > in place these experiments/prototypes can be done in parallel and
> > > tested
> > > > by
> > > > > different organizations without coupling
> > > >
> > > >
> > > > I think it makes sense to timebox.  Did you have any thoughts on the
> > > > duration of experimentation?
> > > >
> > > > 4. decide which candidate is made official announce the migration
> path
> > > > > (deprecate old footers and give timeline for stopping the emission
> of
> > > > dual
> > > > > footers)
> > > >
> > > > I hope the strawman proposal on feature releases [1] can be refined
> and
> > > > applied to this case.
> > > >
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > > [1] https://github.com/apache/parquet-format/pull/258
> > > >
> > > >
> > > >
> > > > On Wed, Jun 5, 2024 at 2:47 AM Antoine Pitrou <[email protected]>
> > > wrote:
> > > >
> > > > >
> > > > > Google docs tend to get lost very quickly. My experience with the
> > > > > Python PEP process leads me to a preference for a .md file in the
> > repo,
> > > > > that can be collectively owned and rely on regular GH-based review
> > > > > tools.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > >
> > > > > On Tue, 4 Jun 2024 18:52:42 -0700
> > > > > Julien Le Dem <[email protected]> wrote:
> > > > > > I agree that flatbuffer is a good option if we are happy with the
> > > perf
> > > > > and
> > > > > > it let's access column metadata in O(1) without reading other
> > > columns.
> > > > > > If we're going to make an incompatible metadata change, let's
> make
> > it
> > > > > once
> > > > > > with a transition path to easily move from PAR1 to PAR3 letting
> > them
> > > > > > coexist in a backward compatible phase for a while.
> > > > > >
> > > > > > I think that before voting on this, we should summarize in a doc
> > the
> > > > > whole
> > > > > > PAR3 footer metadata discussion:
> > > > > > 1) Goals: (O(1) random access, extensibility, ...)
> > > > > > 2) preferred option
> > > > > > 3) migration path.
> > > > > > 4) mention other options we considered and why we didn't pick
> them
> > > > (this
> > > > > > doesn't have to be extensive)
> > > > > >
> > > > > > That will make it easier for people who are impacted but haven't
> > > > actively
> > > > > > contributed to this discussion so far to review and chime in.
> > > > > > This is a big change with large potential impact here.
> > > > > >
> > > > > > Do people prefer google doc or a PR with a .md for this? I
> > personally
> > > > > like
> > > > > > google docs (we can copy it in the repo after approval)
> > > > > >
> > > > > > Julien
> > > > > >
> > > > > >
> > > > > > On Tue, Jun 4, 2024 at 1:53 AM Alkis Evlogimenos
> > > > > > <[email protected]> wrote:
> > > > > >
> > > > > > > >
> > > > > > > > Finally, one point I wanted to highlight here (I also
> mentioned
> > > it
> > > > > in the
> > > > > > > > PR): If we want random access, we have to abolish the concept
> > > that
> > > > > the
> > > > > > > data
> > > > > > > > in the columns array is in a different order than in the
> > schema.
> > > > > Your PR
> > > > > > > > [1] even added a new field schema_index for matching between
> > > > > > > > ColumnMetaData and schema position, but this kills random
> > access.
> > > > If
> > > > > I
> > > > > > > want
> > > > > > > > to read the third column in the schema, then do a O(1) random
> > > > access
> > > > > into
> > > > > > > > the third column chunk only to notice that it's schema index
> is
> > > > > totally
> > > > > > > > different and therefore I need a full exhaustive search to
> find
> > > the
> > > > > > > column
> > > > > > > > that actually belongs to the third column in the schema, then
> > all
> > > > our
> > > > > > > > random access efforts are in vain.
> > > > > > >
> > > > > > >
> > > > > > > `schema_index` is useful to implement
> > > > > > > https://issues.apache.org/jira/browse/PARQUET-183 which is
> more
> > > and
> > > > > more
> > > > > > > prevalent as schemata become wider.
> > > > > > >
> > > > > > > On Mon, Jun 3, 2024 at 5:54 PM Micah Kornfield <
> > > > [email protected]>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Thanks everyone for chiming in.  Some responses inline:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > The thrift
> > > > > > > > > decoder just has to be invoked recursively whenever such a
> > lazy
> > > > > field
> > > > > > > is
> > > > > > > > > required. This is nice, but since it doesn't give us random
> > > > access
> > > > > into
> > > > > > > > > lists, it's also only partially helpful.
> > > > > > > >
> > > > > > > > This point is moot if we move to flatbuffers but I think part
> > of
> > > > the
> > > > > > > > proposals are either using list<binary> or providing
> arrow-like
> > > > > offsets
> > > > > > > > into the serialized binary to support random access of
> > elements.
> > > > > > > >
> > > > > > > >
> > > > > > > > > I don't fully understand this point, can you elaborate on
> it.
> > > It
> > > > > feels
> > > > > > > > like
> > > > > > > > > a non-issue or a super edge case to me. Is this just a
> DuckDB
> > > > > issue? If
> > > > > > > > so,
> > > > > > > > > I am very sure they're happy to change this, as they're
> quite
> > > > > active
> > > > > > > and
> > > > > > > > > also strive for simplicity and I would argue that exposing
> > > thrift
> > > > > > > > directly
> > > > > > > > > isn't that.
> > > > > > > >
> > > > > > > > IIUC, I don't think Thrift is public from an end-user
> > > perspective.
> > > > > It is
> > > > > > > > however public in the fact that internally DuckDB exposes the
> > > > Thrift
> > > > > > > > structs directly to consuming code.
> > > > > > > >
> > > > > > > > * I don't think there is value in providing a 1-to-1 mapping
> > from
> > > > > the
> > > > > > > > >   old footer encoding to the new encoding. On the contrary,
> > > this
> > > > > is the
> > > > > > > > >   opportunity to clean up and correct some of the oddities
> > that
> > > > > have
> > > > > > > > >   accumulated in the past.
> > > > > > > >
> > > > > > > > I think I should clarify this, as I see a few distinct cases
> > > here:
> > > > > > > >
> > > > > > > > 1.  Removing duplication/redundancy that accumulated over the
> > > years
> > > > > for
> > > > > > > > backwards compatibility.
> > > > > > > > 2.  Removing fields that were never used in practice.
> > > > > > > > 3.  Changing the layout of fields (e.g. moving from array of
> > > > structs
> > > > > to
> > > > > > > > struct of arrays) for performance considerations.
> > > > > > > > 4.  Writing potentially less metadata (e.g. summarization of
> > > > metadata
> > > > > > > > today).
> > > > > > > >
> > > > > > > > IMO, I think we should be doing 1,2, and 3.  I don't think we
> > > > should
> > > > > be
> > > > > > > > doing 4 (e.g. as a concrete example, see the discussion on
> > > > > > > > PageEncodingStats [1]).
> > > > > > > >
> > > > > > > > If we want random access, we have to abolish the concept that
> > the
> > > > > data
> > > > > > > > > in the columns array is in a different order than in the
> > > schema.
> > > > > Your
> > > > > > > PR
> > > > > > > > > [1] even added a new field schema_index for matching
> between
> > > > > > > > ColumnMetaData
> > > > > > > > > and schema position, but this kills random access.
> > > > > > > >
> > > > > > > >
> > > > > > > > I think this is a larger discussion that should be split off,
> > as
> > > I
> > > > > don't
> > > > > > > > think it should block the core work here.  This was adapted
> > from
> > > > > another
> > > > > > > > proposal, that I think had different ideas on how possible
> > rework
> > > > > column
> > > > > > > > selection (it seems this would be on a per RowGroup basis).
> > > > > > > >
> > > > > > > > [1]
> > > > >
> https://github.com/apache/parquet-format/pull/250/files#r1620984136
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jun 3, 2024 at 8:20 AM Antoine Pitrou <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Everything Jan said below aligns closely with my opinion.
> > > > > > > > >
> > > > > > > > > * +1 for going directly to Flatbuffers for the new footer
> > > format
> > > > > *if*
> > > > > > > > >   there is a general agreement that moving to Flatbuffers
> at
> > > some
> > > > > point
> > > > > > > > >   is desirable (including from a software ecosystem point
> of
> > > > view).
> > > > > > > > >
> > > > > > > > > * I don't think there is value in providing a 1-to-1
> mapping
> > > from
> > > > > the
> > > > > > > > >   old footer encoding to the new encoding. On the contrary,
> > > this
> > > > > is the
> > > > > > > > >   opportunity to clean up and correct some of the oddities
> > that
> > > > > have
> > > > > > > > >   accumulated in the past.
> > > > > > > > >
> > > > > > > > > Regards
> > > > > > > > >
> > > > > > > > > Antoine.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, 3 Jun 2024 15:58:40 +0200
> > > > > > > > > Jan Finis <[email protected]> wrote:
> > > > > > > > > > Interesting discussion so far, thanks for driving this
> > > Micah! A
> > > > > few
> > > > > > > > > points
> > > > > > > > > > from my side:
> > > > > > > > > >
> > > > > > > > > > When considering flatbuffers vs. lazy "binary" nested
> > thrift,
> > > > > vs. own
> > > > > > > > > > MetaDataPage format, let's also keep architectural
> > simplicity
> > > > > in
> > > > > > > mind.
> > > > > > > > > >
> > > > > > > > > > For example, introducing flatbuffers might sound like a
> big
> > > > > change at
> > > > > > > > > > first, but at least it is then *one format* for
> everything.
> > > In
> > > > > > > > contrast,
> > > > > > > > > > thrift + custom MetaDataPage is two formats. My gut
> feeling
> > > > > estimate
> > > > > > > > > > would be that it is probably easier to just introduce a
> > > > > flatbuffers
> > > > > > > > > reader
> > > > > > > > > > instead of special casing some thrift to instead need a
> > > custom
> > > > > > > > > MetaDataPage
> > > > > > > > > > reader.
> > > > > > > > > >
> > > > > > > > > > The lazy thrift "hack" is something in between the two.
> It
> > is
> > > > > > > probably
> > > > > > > > > the
> > > > > > > > > > easiest to adopt, as no new reading logic needs to be
> > > written.
> > > > > The
> > > > > > > > thrift
> > > > > > > > > > decoder just has to be invoked recursively whenever such
> a
> > > lazy
> > > > > field
> > > > > > > > is
> > > > > > > > > > required. This is nice, but since it doesn't give us
> random
> > > > > access
> > > > > > > into
> > > > > > > > > > lists, it's also only partially helpful.
> > > > > > > > > >
> > > > > > > > > > Given all this, from the implementation / architectural
> > > > > cleanliness
> > > > > > > > > side, I
> > > > > > > > > > guess I would prefer just using flatbuffers, unless we
> find
> > > big
> > > > > > > > > > disadvantages with this. This also brings us closer to
> > Arrow,
> > > > > > > although
> > > > > > > > > > that's not too important here.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > 1.  I think for an initial revision of metadata we
> should
> > > > make
> > > > > it
> > > > > > > > > possible
> > > > > > > > > > > to have a 1:1 mapping between PAR1 footers and whatever
> > is
> > > > > included
> > > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > new footer.  The rationale for this is to let
> > > implementations
> > > > > that
> > > > > > > > > haven't
> > > > > > > > > > > abstracted out thrift structures an easy path to
> > > > incorporating
> > > > > the
> > > > > > > > new
> > > > > > > > > > > footer (i.e. just do translation at the boundaries).
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I don't fully understand this point, can you elaborate on
> > it.
> > > > > It
> > > > > > > feels
> > > > > > > > > like
> > > > > > > > > > a non-issue or a super edge case to me. Is this just a
> > DuckDB
> > > > > issue?
> > > > > > > If
> > > > > > > > > so,
> > > > > > > > > > I am very sure they're happy to change this, as they're
> > quite
> > > > > active
> > > > > > > > and
> > > > > > > > > > also strive for simplicity and I would argue that
> exposing
> > > > > thrift
> > > > > > > > > directly
> > > > > > > > > > isn't that. Our database also allows metadata access in
> > SQL,
> > > > but
> > > > > we
> > > > > > > > > > transcode the thrift into JSON. Given that JSON is pretty
> > > > > standard in
> > > > > > > > > > databases while thrift isn't, I'm sure DuckDB devs will
> see
> > > it
> > > > > the
> > > > > > > > same.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Finally, one point I wanted to highlight here (I also
> > > mentioned
> > > > > it in
> > > > > > > > the
> > > > > > > > > > PR): If we want random access, we have to abolish the
> > concept
> > > > > that
> > > > > > > the
> > > > > > > > > data
> > > > > > > > > > in the columns array is in a different order than in the
> > > > schema.
> > > > > Your
> > > > > > > > PR
> > > > > > > > > > [1] even added a new field schema_index for matching
> > between
> > > > > > > > > ColumnMetaData
> > > > > > > > > > and schema position, but this kills random access. If I
> > want
> > > to
> > > > > read
> > > > > > > > the
> > > > > > > > > > third column in the schema, then do a O(1) random access
> > into
> > > > > the
> > > > > > > third
> > > > > > > > > > column chunk only to notice that it's schema index is
> > totally
> > > > > > > different
> > > > > > > > > and
> > > > > > > > > > therefore I need a full exhaustive search to find the
> > column
> > > > > that
> > > > > > > > > actually
> > > > > > > > > > belongs to the third column in the schema, then all our
> > > random
> > > > > access
> > > > > > > > > > efforts are in vain.
> > > > > > > > > >
> > > > > > > > > > Therefore, the only possible way to make random access
> > useful
> > > > is
> > > > > to
> > > > > > > > > mandate
> > > > > > > > > > that ColumnMetaData in the columns list has to be in
> > exactly
> > > > the
> > > > > same
> > > > > > > > > order
> > > > > > > > > > in which the columns appear in the schema.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Jan
> > > > > > > > > >
> > > > > > > > > > [1] https://github.com/apache/parquet-format/pull/250
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Am Sa., 1. Juni 2024 um 10:38 Uhr schrieb Micah
> Kornfield <
> > > > > > > > > > [email protected]>:
> > > > > > > > > >
> > > > > > > > > > > As an update here/some responses.  Alkis [3] is making
> > > > > considerable
> > > > > > > > > > > progress on a Flatbuffer alternative that shows good
> > > > > performance
> > > > > > > > > benchmarks
> > > > > > > > > > > on some real sample footers (and hopefully soon some
> > > > synthetic
> > > > > data
> > > > > > > > > from
> > > > > > > > > > > Rok).
> > > > > > > > > > >
> > > > > > > > > > > The approaches that currently have public PRs [1][2]
> IIUC
> > > > > mostly
> > > > > > > save
> > > > > > > > > time
> > > > > > > > > > > by lazily decompressing thrift metadata (some of the
> > > details
> > > > > differ
> > > > > > > > > but it
> > > > > > > > > > > is effectively the same mechanism).  This helps for
> cases
> > > > > when
> > > > > > > only a
> > > > > > > > > few
> > > > > > > > > > > row groups/columns are needed but in the limit has the
> > same
> > > > > > > > theoretical
> > > > > > > > > > > performance penalties for full table reads.
> > > > > > > > > > >
> > > > > > > > > > > I would like to get people's take on two points:
> > > > > > > > > > > 1.  I think for an initial revision of metadata we
> should
> > > > make
> > > > > it
> > > > > > > > > possible
> > > > > > > > > > > to have a 1:1 mapping between PAR1 footers and whatever
> > is
> > > > > included
> > > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > new footer.  The rationale for this is to let
> > > implementations
> > > > > that
> > > > > > > > > haven't
> > > > > > > > > > > abstracted out thrift structures an easy path to
> > > > incorporating
> > > > > the
> > > > > > > > new
> > > > > > > > > > > footer (i.e. just do translation at the boundaries).
> > > > > > > > > > > 2.  Do people see value in trying to do a Thrift only
> > > > > iteration
> > > > > > > which
> > > > > > > > > > > addresses the use-case of scanning only a select number
> > of
> > > > row
> > > > > > > > > > > groups/columns?  Or if Flatbuffers offer an overall
> > better
> > > > > > > > performance
> > > > > > > > > > > should we jump to using it?
> > > > > > > > > > >
> > > > > > > > > > > After processing the comments I think we might want to
> > > > discuss
> > > > > the
> > > > > > > > > > > > extension point
> > > > > > > https://github.com/apache/parquet-format/pull/254
> > > > > > > > > > > >  separately.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I think this is already getting reviewed (I also think
> we
> > > > > touched
> > > > > > > on
> > > > > > > > > it in
> > > > > > > > > > > the extensibility thread).  Since this is really just
> > > > defining
> > > > > how
> > > > > > > we
> > > > > > > > > can
> > > > > > > > > > > encapsulate data and doesn't involve any upfront work,
> I
> > > > think
> > > > > once
> > > > > > > > > > > everyone has had a chance to comment on it we can
> > hopefully
> > > > > hold a
> > > > > > > > > vote on
> > > > > > > > > > > it (hopefully in the next week or 2).  I think the only
> > > other
> > > > > > > viable
> > > > > > > > > > > alternative is what is proposed in [2] which doesn't
> > > involve
> > > > > any
> > > > > > > > > mucking
> > > > > > > > > > > with Thrift bytes but poses a slightly larger
> > compatibility
> > > > > risk.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Micah
> > > > > > > > > > >
> > > > > > > > > > > [1] https://github.com/apache/parquet-format/pull/242
> > > > > > > > > > > [2] https://github.com/apache/parquet-format/pull/250
> > > > > > > > > > > [3]
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/pull/250#pullrequestreview-2091174869
> > > > >
> > > > > > > > > > >
> > > > > > > > > > > On Thu, May 30, 2024 at 7:21 AM Alkis Evlogimenos <
> > > > > > > > > > > [email protected]> wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Thank you for summarizing Micah and thanks to
> everyone
> > > > > commenting
> > > > > > > > on
> > > > > > > > > the
> > > > > > > > > > > > proposal and PRs.
> > > > > > > > > > > >
> > > > > > > > > > > > After processing the comments I think we might want
> to
> > > > > discuss
> > > > > > > the
> > > > > > > > > > > > extension point
> > > > > > > https://github.com/apache/parquet-format/pull/254
> > > > > > > > > > > > separately.
> > > > > > > > > > > >
> > > > > > > > > > > > The extension point will allow vendors to experiment
> on
> > > > > different
> > > > > > > > > > > metadata
> > > > > > > > > > > > (be it FileMetaData, or ColumnMetaData etc) and when
> a
> > > > > design is
> > > > > > > > > ready
> > > > > > > > > > > and
> > > > > > > > > > > > validated in large scale, it can be discussed for
> > > inclusion
> > > > > to
> > > > > > > the
> > > > > > > > > > > official
> > > > > > > > > > > > specification.
> > > > > > > > > > > >
> > > > > > > > > > > > On Thu, May 30, 2024 at 9:37 AM Micah Kornfield <
> > > > > > > > > [email protected]>
> > > > > > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >> As an update Alkis wrote up a nice summary of his
> > > thoughts
> > > > > > > [1][2].
> > > > > > > > > > > >>
> > > > > > > > > > > >> I updated my PR <
> > > > > > > > https://github.com/apache/parquet-format/pull/250>
> > > > > > > > > > > >> [3] to be more complete.  At a high-level (for those
> > > that
> > > > > have
> > > > > > > > > already
> > > > > > > > > > > >> reviewed):
> > > > > > > > > > > >> 1. I converted more fields to use page-encoding (or
> > > added
> > > > > a
> > > > > > > binary
> > > > > > > > > field
> > > > > > > > > > > >> for thrift serialized encoding when they are
> expected
> > to
> > > > > be
> > > > > > > > small).
> > > > > > > > > > > >> This might be overdone (happy for this feedback to
> > > > debate).
> > > > > > > > > > > >> 2.  I removed the concept of an external data page
> for
> > > the
> > > > > sake
> > > > > > > of
> > > > > > > > > > > trying
> > > > > > > > > > > >> to remove design options (we should still benchmark
> > > this).
> > > > > It
> > > > > > > also
> > > > > > > > > I
> > > > > > > > > > > think
> > > > > > > > > > > >> eases implementation burden (more on this below).
> > > > > > > > > > > >> 3.  Removed the new encoding.
> > > > > > > > > > > >> 4.  I think this is still missing some of the exact
> > > > changes
> > > > > from
> > > > > > > > > other
> > > > > > > > > > > >> PRs, some of those might be in error (please
> highlight
> > > > > them) and
> > > > > > > > > some
> > > > > > > > > > > are
> > > > > > > > > > > >> because I hope the individual PRs (i.e. the
> statistics
> > > > > change
> > > > > > > that
> > > > > > > > > Alkis
> > > > > > > > > > > >> proposed can get merged before any proposal)
> > > > > > > > > > > >>
> > > > > > > > > > > >> Regarding embedding PAR3 embedding, Alkis's doc [1]
> > > > > highlights
> > > > > > > > > another
> > > > > > > > > > > >> option for doing this that might be more robust but
> > > > > slightly
> > > > > > > more
> > > > > > > > > > > >> complicated.
> > > > > > > > > > > >>
> > > > > > > > > > > >> I think in terms of items already discussed, whether
> > to
> > > > try
> > > > > to
> > > > > > > > reuse
> > > > > > > > > > > >> existing structures or use new structures (Alkis is
> > > > > proposing
> > > > > > > > going
> > > > > > > > > > > >> straight to flatbuffers in this regard IIUC after
> some
> > > > > more
> > > > > > > > tactical
> > > > > > > > > > > >> changes).  I think another point raised is the
> problem
> > > > with
> > > > > new
> > > > > > > > > > > structures
> > > > > > > > > > > >> is they require implementations (e.g. DuckDB) that
> do
> > > not
> > > > > > > > > encapsulate
> > > > > > > > > > > >> Thrift well to make potentially much larger
> structural
> > > > > changes.
> > > > > > > > > The
> > > > > > > > > > > way I
> > > > > > > > > > > >> tried to approach it in my PR is it should be
> O(days)
> > > work
> > > > > to
> > > > > > > take
> > > > > > > > > a
> > > > > > > > > > > PAR3
> > > > > > > > > > > >> footer and convert it back to PAR1, which will
> > hopefully
> > > > > allow
> > > > > > > > other
> > > > > > > > > > > >> Parquet parsers in the ecosystems to at least get
> > > > > incorporated
> > > > > > > > > sooner
> > > > > > > > > > > even
> > > > > > > > > > > >> if no performance benefits are seen.
> > > > > > > > > > > >>
> > > > > > > > > > > >> Quoting from a separate thread that Alkis Started:
> > > > > > > > > > > >>
> > > > > > > > > > > >> 3 is important if we strongly believe that we can
> get
> > > the
> > > > > best
> > > > > > > > > design
> > > > > > > > > > > >>> through testing prototypes on real data and
> measuring
> > > the
> > > > > > > effects
> > > > > > > > > vs
> > > > > > > > > > > >>> designing changes in PRs. Along the same lines, I
> am
> > > > > requesting
> > > > > > > > > that
> > > > > > > > > > > you
> > > > > > > > > > > >>> ask through your contacts/customers (I will do the
> > > same)
> > > > > for
> > > > > > > > > scrubbed
> > > > > > > > > > > >>> footers of particular interest (wide, deep, etc) so
> > > that
> > > > > we can
> > > > > > > > > build a
> > > > > > > > > > > >>> set
> > > > > > > > > > > >>> of real footers on which we can run benchmarks and
> > > drive
> > > > > design
> > > > > > > > > > > >>> decisions.
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> I agree with this sentiment. I think some others who
> > > have
> > > > > > > > > volunteered to
> > > > > > > > > > > >> work on this have such data and I will see what I
> can
> > do
> > > > on
> > > > > my
> > > > > > > > > end.  I
> > > > > > > > > > > >> think we should hold off more drastic
> > > changes/improvements
> > > > > until
> > > > > > > > we
> > > > > > > > > can
> > > > > > > > > > > get
> > > > > > > > > > > >> better metrics.  But I also don't think we should
> let
> > > the
> > > > > "best"
> > > > > > > > be
> > > > > > > > > the
> > > > > > > > > > > >> enemy of the "good".  I hope we can ship a PAR3
> footer
> > > > > sooner
> > > > > > > that
> > > > > > > > > gets
> > > > > > > > > > > us
> > > > > > > > > > > >> a large improvement over the status quo and have it
> > > > > adopted
> > > > > > > fairly
> > > > > > > > > > > widely
> > > > > > > > > > > >> sooner rather than waiting for an optimal design.  I
> > > also
> > > > > agree
> > > > > > > > > leaving
> > > > > > > > > > > >> room for experimentation is a good idea (I think
> this
> > > can
> > > > > > > probably
> > > > > > > > > be
> > > > > > > > > > > done
> > > > > > > > > > > >> by combining the methods for embedding that have
> > already
> > > > > been
> > > > > > > > > discussed
> > > > > > > > > > > to
> > > > > > > > > > > >> allow potentially 2 embedded footers).
> > > > > > > > > > > >>
> > > > > > > > > > > >> I think another question that Alkis's proposals
> raised
> > > is
> > > > > how
> > > > > > > > > policies
> > > > > > > > > > > on
> > > > > > > > > > > >> deprecation of fields (especially ones that are
> > > currently
> > > > > > > required
> > > > > > > > > in
> > > > > > > > > > > >> PAR1).  I think this is probably a better topic for
> > > > > another
> > > > > > > > thread,
> > > > > > > > > I'll
> > > > > > > > > > > >> try to write a PR formalizing a proposal on feature
> > > > > evolution.
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >>
> > > > > > > > > > > >> [1]
> > > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit
> > > > >
> > > > > > > > >
> > > > > > > > > > > >> [2]
> > > > > > > > >
> > > https://lists.apache.org/thread/zdpswrd4yxrj845rmoopqozhk0vrm6vo
> > > > > > > > > > > >> [3]
> https://github.com/apache/parquet-format/pull/250
> > > > > > > > > > > >>
> > > > > > > > > > > >> On Tue, May 28, 2024 at 10:56 AM Micah Kornfield <
> > > > > > > > > [email protected]
> > > > > > > > > > > >
> > > > > > > > > > > >> wrote:
> > > > > > > > > > > >>
> > > > > > > > > > > >>> Hi Antoine,
> > > > > > > > > > > >>> Thanks for the great points.  Responses inline.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>> I like your attempt to put the "new" file metadata
> > > after
> > > > > the
> > > > > > > > > legacy
> > > > > > > > > > > >>>> one in
> > > > https://github.com/apache/parquet-format/pull/250,
> > > > >
> > > > > > > and I
> > > > > > > > > hope
> > > > > > > > > > > it
> > > > > > > > > > > >>>> can actually be made to work (it requires current
> > > > > Parquet
> > > > > > > > readers
> > > > > > > > > to
> > > > > > > > > > > >>>> allow/ignore arbitrary padding at the end of the
> v1
> > > > > Thrift
> > > > > > > > > metadata).
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Thanks (I hope so too).  I think the idea is
> > originally
> > > > > from
> > > > > > > > > Alkis.  If
> > > > > > > > > > > >>> it doesn't work then there is always an option of
> > > doing a
> > > > > > > little
> > > > > > > > > more
> > > > > > > > > > > >>> involved process of making the footer look like an
> > > > > unknown
> > > > > > > binary
> > > > > > > > > > > field (an
> > > > > > > > > > > >>> approach I know you have objections to).
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> I'm biased, but I find it much cleaner to define
> new
> > > > > Thrift
> > > > > > > > > > > >>>>   structures (FileMetadataV3, etc.), rather than
> > > > > painstakinly
> > > > > > > > > document
> > > > > > > > > > > >>>>   which fields are to be omitted in V3. That would
> > > > > achieve
> > > > > > > three
> > > > > > > > > > > goals:
> > > > > > > > > > > >>>>   1) make the spec easier to read (even though it
> > > would
> > > > > be
> > > > > > > > > physically
> > > > > > > > > > > >>>>   longer); 2) make it easier to produce a
> conformant
> > > > > > > > > implementation
> > > > > > > > > > > >>>>   (special rules increase the risks of
> > > misunderstandings
> > > > > and
> > > > > > > > > > > >>>>   disagreements); 3) allow a later cleanup of the
> > spec
> > > > > once we
> > > > > > > > > agree
> > > > > > > > > > > to
> > > > > > > > > > > >>>>   get rid of V1 structs.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> There are trade-offs here.  I agree with the
> benefits
> > > you
> > > > > > > listed
> > > > > > > > > here.
> > > > > > > > > > > >>> The benefits of reusing existing structs are:
> > > > > > > > > > > >>> 1. Lowers the amount of boiler plate code mapping
> > from
> > > > one
> > > > > to
> > > > > > > the
> > > > > > > > > other
> > > > > > > > > > > >>> (i.e. simpler initial implementation), since I
> expect
> > > it
> > > > > will
> > > > > > > be
> > > > > > > > > a
> > > > > > > > > > > while
> > > > > > > > > > > >>> before we have standalone PAR3 files.
> > > > > > > > > > > >>> 2. Allows for lower maintenance burden if there is
> > > useful
> > > > > new
> > > > > > > > > metadata
> > > > > > > > > > > >>> that we would like to see added to both structures
> > > > > original and
> > > > > > > > > "V3"
> > > > > > > > > > > >>> structures.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> - The new encoding in that PR seems like it should
> be
> > > > > moved to
> > > > > > > a
> > > > > > > > > > > >>>>   separate PR and be discussed in the encodings
> > > thread?
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> I'll cross post on that thread.  The main reason I
> > > > > included it
> > > > > > > in
> > > > > > > > > my
> > > > > > > > > > > >>> proposal is I think it provides random access for
> > > members
> > > > > out
> > > > > > > of
> > > > > > > > > the
> > > > > > > > > > > box
> > > > > > > > > > > >>> (as compared to the existing encodings).  I think
> > this
> > > > > mostly
> > > > > > > > goes
> > > > > > > > > to
> > > > > > > > > > > your
> > > > > > > > > > > >>> third-point so I'll discuss below.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> - I'm a bit skeptical about moving Thrift lists
> into
> > > data
> > > > > > > pages,
> > > > > > > > > rather
> > > > > > > > > > > >>>>   than, say, just embed the corresponding Thrift
> > > > > serialization
> > > > > > > > as
> > > > > > > > > > > >>>>   binary fields for lazy deserialization.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> I think this falls into 2 different concerns:
> > > > > > > > > > > >>> 1.  The format of how we serialize metadata.
> > > > > > > > > > > >>> 2.  Where the serialized metadata lives.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> For concern #1, I think we should be considering
> > > treating
> > > > > these
> > > > > > > > > lists
> > > > > > > > > > > as
> > > > > > > > > > > >>> actual parquet data pages.  This allows users to
> tune
> > > > this
> > > > > to
> > > > > > > > > their
> > > > > > > > > > > needs
> > > > > > > > > > > >>> for size vs decoding speed, and make use of any
> > > > > improvements to
> > > > > > > > > > > encoding
> > > > > > > > > > > >>> that happen in the future without a spec change. I
> > > think
> > > > > this
> > > > > > > is
> > > > > > > > > likely
> > > > > > > > > > > >>> fairly valuable given the number of systems that
> > cache
> > > > > this
> > > > > > > data.
> > > > > > > > > The
> > > > > > > > > > > >>> reason I introduced the new encoding was to provide
> > an
> > > > > option
> > > > > > > > > that
> > > > > > > > > > > could be
> > > > > > > > > > > >>> as efficient as possible from a compute
> perspective.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> For concern #2, there is no reason encoding a page
> > as a
> > > > > thrift
> > > > > > > > > Binary
> > > > > > > > > > > >>> field would not work. The main reason I raised
> > putting
> > > > > them
> > > > > > > > > outside of
> > > > > > > > > > > >>> thrift is for greater control on deserialization
> (the
> > > > > main
> > > > > > > > > benefit
> > > > > > > > > > > being
> > > > > > > > > > > >>> avoiding copies) for implementations that have a
> > Thrift
> > > > > parser
> > > > > > > > > that
> > > > > > > > > > > doesn't
> > > > > > > > > > > >>> allow these optimizations.  In terms of a path
> > forward
> > > > > here, I
> > > > > > > > > think
> > > > > > > > > > > >>> understanding the performance and memory
> > > characteristics
> > > > > of
> > > > > > > each
> > > > > > > > > > > approach.
> > > > > > > > > > > >>> I agree, if there isn't substantial savings from
> > having
> > > > > them be
> > > > > > > > > > > outside the
> > > > > > > > > > > >>> page, then it just adds complexity.
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> Thanks,
> > > > > > > > > > > >>> Micah
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>
> > > > > > > > > > > >>> On Tue, May 28, 2024 at 7:06 AM Antoine Pitrou <
> > > > > > > > [email protected]
> > > > > > > > > >
> > > > > > > > > > > >>> wrote:
> > > > > > > > > > > >>>
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Hello Micah,
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> First, kudos for doing this!
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> I like your attempt to put the "new" file metadata
> > > after
> > > > > the
> > > > > > > > > legacy
> > > > > > > > > > > >>>> one in
> > > > https://github.com/apache/parquet-format/pull/250,
> > > > >
> > > > > > > and I
> > > > > > > > > hope
> > > > > > > > > > > it
> > > > > > > > > > > >>>> can actually be made to work (it requires current
> > > > > Parquet
> > > > > > > > readers
> > > > > > > > > to
> > > > > > > > > > > >>>> allow/ignore arbitrary padding at the end of the
> v1
> > > > > Thrift
> > > > > > > > > metadata).
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Some assorted comments on other changes that PR is
> > > > doing:
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> - I'm biased, but I find it much cleaner to define
> > new
> > > > > Thrift
> > > > > > > > > > > >>>>   structures (FileMetadataV3, etc.), rather than
> > > > > painstakinly
> > > > > > > > > document
> > > > > > > > > > > >>>>   which fields are to be omitted in V3. That would
> > > > > achieve
> > > > > > > three
> > > > > > > > > > > goals:
> > > > > > > > > > > >>>>   1) make the spec easier to read (even though it
> > > would
> > > > > be
> > > > > > > > > physically
> > > > > > > > > > > >>>>   longer); 2) make it easier to produce a
> conformant
> > > > > > > > > implementation
> > > > > > > > > > > >>>>   (special rules increase the risks of
> > > misunderstandings
> > > > > and
> > > > > > > > > > > >>>>   disagreements); 3) allow a later cleanup of the
> > spec
> > > > > once we
> > > > > > > > > agree
> > > > > > > > > > > to
> > > > > > > > > > > >>>>   get rid of V1 structs.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> - The new encoding in that PR seems like it should
> > be
> > > > > moved
> > > > > > > to a
> > > > > > > > > > > >>>>   separate PR and be discussed in the encodings
> > > thread?
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> - I'm a bit skeptical about moving Thrift lists
> into
> > > > > data
> > > > > > > pages,
> > > > > > > > > > > rather
> > > > > > > > > > > >>>>   than, say, just embed the corresponding Thrift
> > > > > serialization
> > > > > > > > as
> > > > > > > > > > > >>>>   binary fields for lazy deserialization.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Regards
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> Antoine.
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>> On Mon, 27 May 2024 23:06:37 -0700
> > > > > > > > > > > >>>> Micah Kornfield <
> > > > > > > > >
> > > > >
> > > >
> > >
> >
> emkornfield-re5jqeeqqe8avxtiumwx3w-xmd5yjdbdmrexy1tmh2...@public.gmane.org
> > > > >
> > > > >
> > > > > > > > > > > >>>> wrote:
> > > > > > > > > > > >>>> > As a follow-up to the "V3" Discussions [1][2] I
> > > wanted
> > > > > to
> > > > > > > > start
> > > > > > > > > a
> > > > > > > > > > > >>>> thread on
> > > > > > > > > > > >>>> > improvements to the footer metadata.
> > > > > > > > > > > >>>> >
> > > > > > > > > > > >>>> > Based on conversation so far, there have been a
> > few
> > > > > > > proposals
> > > > > > > > > > > >>>> [3][4][5] to
> > > > > > > > > > > >>>> > help better support files with wide schemas and
> > many
> > > > > > > > > row-groups.  I
> > > > > > > > > > > >>>> think
> > > > > > > > > > > >>>> > there are a lot of interesting ideas in each. It
> > > would
> > > > > be
> > > > > > > good
> > > > > > > > > to
> > > > > > > > > > > get
> > > > > > > > > > > >>>> > further feedback on these to make sure we aren't
> > > > > missing
> > > > > > > > > anything
> > > > > > > > > > > and
> > > > > > > > > > > >>>> > define a minimal first iteration for doing
> > > > > experimental
> > > > > > > > > benchmarking
> > > > > > > > > > > >>>> to
> > > > > > > > > > > >>>> > prove out an approach.
> > > > > > > > > > > >>>> >
> > > > > > > > > > > >>>> > I think the next steps would ideally be:
> > > > > > > > > > > >>>> > 1.  Come to a consensus on the overall approach.
> > > > > > > > > > > >>>> > 2.  Prototypes to Benchmark/test to validate the
> > > > > approaches
> > > > > > > > > defined
> > > > > > > > > > > >>>> (if we
> > > > > > > > > > > >>>> > can't come to consensus in item #1, this might
> > help
> > > > > choose a
> > > > > > > > > > > >>>> direction).
> > > > > > > > > > > >>>> > 3.  Divide up any final approach into as
> > > fine-grained
> > > > > > > features
> > > > > > > > > as
> > > > > > > > > > > >>>> possible.
> > > > > > > > > > > >>>> > 4.  Implement across parquet-java, parquet-cpp,
> > > > > parquet-rs
> > > > > > > > (and
> > > > > > > > > any
> > > > > > > > > > > >>>> other
> > > > > > > > > > > >>>> > implementations that we can get volunteers for).
> > > > > > > > Additionally,
> > > > > > > > > if
> > > > > > > > > > > >>>> new APIs
> > > > > > > > > > > >>>> > are needed to make use of the new structure, it
> > > would
> > > > > be
> > > > > > > good
> > > > > > > > > to try
> > > > > > > > > > > >>>> to
> > > > > > > > > > > >>>> > prototype against consumers of Parquet.
> > > > > > > > > > > >>>> >
> > > > > > > > > > > >>>> > Knowing that we have enough people interested in
> > > doing
> > > > > #3 is
> > > > > > > > > > > critical
> > > > > > > > > > > >>>> to
> > > > > > > > > > > >>>> > success, so if you have time to devote, it would
> > be
> > > > > helpful
> > > > > > > to
> > > > > > > > > chime
> > > > > > > > > > > >>>> in
> > > > > > > > > > > >>>> > here (I know some people already noted they
> could
> > > help
> > > > > in
> > > > > > > the
> > > > > > > > > > > original
> > > > > > > > > > > >>>> > thread).
> > > > > > > > > > > >>>> >
> > > > > > > > > > > >>>> > I think it is likely we will need either an in
> > > person
> > > > > sync
> > > > > > > or
> > > > > > > > > > > another
> > > > > > > > > > > >>>> more
> > > > > > > > > > > >>>> > focused design document could help. I am happy
> to
> > > try
> > > > > to
> > > > > > > > > facilitate
> > > > > > > > > > > >>>> this
> > > > > > > > > > > >>>> > (once we have a better sense of who wants to be
> > > > > involved and
> > > > > > > > > what
> > > > > > > > > > > time
> > > > > > > > > > > >>>> > zones they are in I can schedule a sync if
> > > necessary).
> > > > > > > > > > > >>>> >
> > > > > > > > > > > >>>> > Thanks,
> > > > > > > > > > > >>>> > Micah
> > > > > > > > > > > >>>> >
> > > > > > > > > > > >>>> > [1]
> > > > > > > > > > >
> > > > > https://lists.apache.org/thread/5jyhzkwyrjk9z52g0b49g31ygnz73gxo
> > > > > > > > > > > >>>> > [2]
> > > > > > > > > > > >>>> >
> > > > > > > > > > > >>>>
> > > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/19hQLYcU5_r5nJB7GtnjfODLlSDiNS24GXAtKg9b0_ls/edit
> > > > >
> > > > > > > > >
> > > > > > > > > > > >>>> > [3]
> > > https://github.com/apache/parquet-format/pull/242
> > > > > > > > > > > >>>> > [4]
> > > https://github.com/apache/parquet-format/pull/248
> > > > > > > > > > > >>>> > [5]
> > > https://github.com/apache/parquet-format/pull/250
> > > > > > > > > > > >>>> >
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > > >>>>
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Improvements to File Footer metadata (v3 discussion follow-up)

Reply via email to