Re: Parquet File Meta Data & Compatibility

Gabor Szadovszky Thu, 10 Dec 2020 06:33:51 -0800

Created PARQUET-1950 <https://issues.apache.org/jira/browse/PARQUET-1950> to
track this. Feel free to comment in the jira or wait for the draft PR.


On Tue, Dec 8, 2020 at 8:06 PM Tim Armstrong
<tarmstr...@cloudera.com.invalid> wrote:

> I agree with Gabor's idea - I love getting into nitty gritty but this
> probably isn't the right place.
>
> On Tue, Dec 8, 2020 at 4:26 AM Gabor Szadovszky <ga...@apache.org> wrote:
>
> > Without discussing further what should be included in the list of core
> > features I would propose a framing of this idea.
> >
> > Instead of commenting the different objects in the thrift file (or in the
> > other related documentations) for being "experimental" or so I think it
> is
> > more clear to have a separate doc in the parquet-format repo that defines
> > the core features. (The thrift file is more about listing all the
> features
> > we are able to use.)
> > This new documentation is versioned by the parquet-format releases so we
> > have the exact list of core features for each release.
> >
> > We can also introduce a new (optional) field in the thrift file (e.g.
> > complianceLevel) where any implementation can write the parquet-format
> > release which core features are implemented. (Using a separate field
> > instead of an additional key-value makes it easier to reference the core
> > features documentation from the thrift file.)
> >
> > If you like this idea I am happy to create a jira and a PR where we can
> > start/continue the discussion about the core features themselves.
> >
> >
> > On Mon, Dec 7, 2020 at 9:00 PM Tim Armstrong
> > <tarmstr...@cloudera.com.invalid> wrote:
> >
> > > > Introducing new logical types as "experimental" is a bit tricky.
> > > Maybe experimental is a bad term. I think mostly new features in the
> > format
> > > do need to be backwards compatible and not buggy because data lasts a
> > long
> > > time once it's written. Maybe "incubating" or "preview" is a better
> > term. I
> > > guess it's important that a) they are not automatically enabled so
> users
> > > don't accidentally write out data other tools can't read and b) that
> the
> > > format is clear that readers implementing version x don't need to be
> able
> > > to read data that uses incubating features from version x in order to
> be
> > > compliant.
> > >
> > > >One more thing about the core features. If we finally agree on a set
> of
> > > > these how do we put new features in after a while? I mean how do we
> > > > "version" the list of core features? Semantic versioning can be an
> > > obvious
> > > > choice and we can even use the version of parquet-format. In this
> case
> > we
> > > > enforce the implementers to implement all the new core features if
> they
> > > > want to keep track of the versions. Is that what we want?
> > > I think that would be an improvement from my point of view if we
> > released a
> > > new version of parquet-format with fewer core features but clear
> wording
> > > that the core features *must* be implemented. That could add a new
> stick
> > > for implementors - they are not in compliance with the standard if they
> > > can't read all core feature or if they write non-compliant data. But
> > > there's also a carrot - you can claim to be in compliance with the
> > standard
> > > and you don't need to deal with as many edge cases from other writers.
> > >
> > > E.g. for Impala it would be great for us to know exactly what our gaps
> > are
> > > in reading, instead of having to add new support when it comes up. But
> I
> > > think we'd have to do some things to be better Parquet citizens, e.g.
> > > writing int64 instead of int96 timestamps by default.
> > >
> > >
> > > > Could you expand on this?
> > > I was probably overstating this. I took another look and it seems like
> > it's
> > > probably still a defensible decision, but I'm not sure that it's the
> > ideal
> > > choice. It gets decent compression, it's fast to encode/decode with
> SIMD
> > > and it's simpler than some alternatives. It was made 8 years ago and
> I'm
> > > not sure of the exact rationale. -
> > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
> > > .
> > >
> > >
> > > The paper it's taken from actually shows that some other encodings
> > > that handle exceptional values via patching (their PFOR schemes)
> achieve
> > > better compression on realistic data.
> > > https://people.csail.mit.edu/jshun/6886-s19/lectures/lecture19-1.pdf
> > has a
> > > more accessible summary of this. I thought there had been some more
> > > improvements in the literature since 2012 but I wasn't able to find
> much
> > so
> > > may have been misremembering. I did come across this paper about
> > > implementing the Parquet delta decoding more efficiently, but that
> didn't
> > > require changes to the format -
> > > https://newtraell.cs.uchicago.edu/files/ms_paper/hajiang.pdf.
> > >
> > > On Mon, Dec 7, 2020 at 7:57 AM Gabor Szadovszky <ga...@apache.org>
> > wrote:
> > >
> > > > I agree on separating the non widely used features to make the life
> of
> > > the
> > > > implementers easier and to improve compatibility between these
> > > > implementations.
> > > > Meanwhile, it is not always clear how to define the core features.
> For
> > > > example, the encryption feature will be released soon in parquet-mr
> > and I
> > > > guess parquet-cpp as well (if not released yet). Would it be part of
> > the
> > > > core features already?
> > > >
> > > > Another thing is the compression. Which one would be part of the core
> > > > features and which one would we like to drop (moving to
> experimental)?
> > > >
> > > > We can skip features like column indexes from the core ones as they
> are
> > > not
> > > > required to read the data written in the file. Meanwhile, if they are
> > > never
> > > > part of the core features we might generate a lot of files without
> the
> > > > related metadata hurting the read performance of the implementations
> > > where
> > > > the feature is implemented. (BTW for column indexes we already have
> the
> > > two
> > > > cross-tested implementations.)
> > > >
> > > > Introducing new logical types as "experimental" is a bit tricky. In
> > > > parquet-mr we do not give too much help to our consumers for logical
> > > types.
> > > > For example for a UUID logical type we still give the user a Binary
> > > object
> > > > instead of a java UUID. So waiting for other implementations does not
> > > > really make sense. I also don't think it would be a good idea to keep
> > > these
> > > > logical types "experimental" until some of our consumers start using
> > > them.
> > > > Nobody wants to start using and invest in experimental features.
> > > >
> > > > I think encodings are an easier topic as they aren't visible for the
> > > > outside (like the file schema). But we still have to list the ones we
> > > would
> > > > like to have in the "core features". From the parquet-mr point of
> view
> > > the
> > > > "V1" encodings are used widely while the "V2" ones are not, so I
> would
> > > move
> > > > the latter (RLE = 3, DELTA_BYTE_ARRAY = 7, DELTA_BINARY_PACKED = 5,
> > > > RLE_DICTIONARY = 8) to the experimental category.
> > > > (NOTE: For the parquet-mr implementation I would suggest to support
> the
> > > > experimental encodings by adding a property where the user can set
> the
> > > > encodings by column accepting the risk of being incompatible with
> other
> > > > implementations.)
> > > >
> > > > One more thing about the core features. If we finally agree on a set
> of
> > > > these how do we put new features in after a while? I mean how do we
> > > > "version" the list of core features? Semantic versioning can be an
> > > obvious
> > > > choice and we can even use the version of parquet-format. In this
> case
> > we
> > > > enforce the implementers to implement all the new core features if
> they
> > > > want to keep track of the versions. Is that what we want?
> > > >
> > > >
> > > >
> > > > On Mon, Dec 7, 2020 at 6:16 AM Micah Kornfield <
> emkornfi...@gmail.com>
> > > > wrote:
> > > >
> > > > > I tried to get some level of clarification in a PR [1].  It kind of
> > > > stalled
> > > > > because we had further conversation on a sync call a while ago
> that I
> > > > have
> > > > > not had a chance to follow-up on.  I'm happy to revise if we can
> come
> > > up
> > > > > with some sort of consensus for experimental/non-experimental.
> > > > >
> > > > > The things that were decided on the sync call were:
> > > > > 1.  Does it make sense to deprecate data page v2.  It sounded like
> > this
> > > > > isn't widely used and the main benefits are either unused
> > [uncompressed
> > > > > rep/def level], or are already required from other features (e.g.
> > > record
> > > > > level alignment within a data page is required with indices IIUC).
> > > > > 2.  Potentially accepting that all encodings in V1 and V2 are
> > > acceptable
> > > > > for V1 data pages.
> > > > >
> > > > > I think there is a lot more conversation here.  But I agree with
> > > > everything
> > > > > said on this thread we more clarity on what is required/not
> required
> > > for
> > > > > implementations.
> > > > >
> > > > >  e.g. I don't think the delta encoding reflects current best
> > practices
> > > in
> > > > > > the literature.
> > > > >
> > > > >
> > > > > Could you expand on this?
> > > > >
> > > > >
> > > > > [1] https://github.com/apache/parquet-format/pull/163
> > > > >
> > > > > On Fri, Dec 4, 2020 at 3:57 PM Antoine Pitrou <solip...@pitrou.net
> >
> > > > wrote:
> > > > >
> > > > > > On Fri, 4 Dec 2020 11:21:58 -0800
> > > > > > Tim Armstrong
> > > > > > <tarmstr...@cloudera.com.INVALID> wrote:
> > > > > > > I probably didn't say it very clearly, but my opinion as a
> > consumer
> > > > of
> > > > > > the
> > > > > > > Parquet spec is that the format needs a reset where encodings,
> > > > logical
> > > > > > > types and other metadata that are not widely adopted are
> removed
> > > from
> > > > > the
> > > > > > > core spec and put in an "experimental" category from which we
> can
> > > > later
> > > > > > > promote things based on community consensus.
> > > > > >
> > > > > > I'm not a participant in this community but I'm one of the people
> > who
> > > > > > maintain parquet-cpp as part the Arrow C++ codebase, and your
> > > proposal
> > > > > > sounds like a very good idea to me.
> > > > > >
> > > > > > Even the core parts of Parquet can be tricky to implement
> > > efficiently,
> > > > > > adding more and more features and encodings is IMHO very
> dangerous
> > > for
> > > > > > the format's portability.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > On Fri, Dec 4, 2020 at 11:14 AM Tim Armstrong <
> > > > tarmstr...@cloudera.com
> > > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I think it would be good for the project to define a core set
> > of
> > > > > > features
> > > > > > > > that a Parquet implementation must support to be able to
> > > correctly
> > > > > read
> > > > > > > > files all written by another compliant writer with the same
> > > > version.
> > > > > > > >
> > > > > > > > There are then additional extensions like page indices that
> are
> > > not
> > > > > > > > required to actually be able to read/write the files
> correctly.
> > > > > > > >
> > > > > > > > My understanding of how we got here is that the Parquet 2.0
> > spec
> > > > > > defined a
> > > > > > > > number of new features and encodings that were only
> implemented
> > > in
> > > > > > > > parquet-mr. Many of these have not gotten widely implemented,
> > and
> > > > I'm
> > > > > > also
> > > > > > > > not sure that the encodings are particularly well designed,
> > e.g.
> > > I
> > > > > > don't
> > > > > > > > think the delta encoding reflects current best practices in
> the
> > > > > > literature.
> > > > > > > >
> > > > > > > > E.g. in Impala we did not implement many of the new Parquet
> 2.0
> > > > > > encodings
> > > > > > > > because it was a significant amount of work and we had a lot
> of
> > > > > > competing
> > > > > > > > priorities. It's now a chicken and egg problem because
> there's
> > > > isn't
> > > > > > much
> > > > > > > > data encoded in that way out there and thus not much
> motivation
> > > to
> > > > > > invest
> > > > > > > > in the work. We've had some similar headaches with logic
> types
> > > and
> > > > > > > > encodings, e.g. where the standard allows several degrees of
> > > > freedom
> > > > > > in how
> > > > > > > > to encode DECIMALs, including ones that don't make much
> sense.
> > > > We've
> > > > > > wasted
> > > > > > > > significant time adding support for DECIMALs that were
> written
> > in
> > > > > > strange
> > > > > > > > ways by another writer.
> > > > > > > >
> > > > > > > > In the meantime we did actually need to add other new
> features
> > to
> > > > > > Parquet
> > > > > > > > like page indices, so moved forward with those, which is part
> > of
> > > > how
> > > > > > it's
> > > > > > > > turned into a bit of a matrix.
> > > > > > > >
> > > > > > > > My suggestion is to go through the encodings and features and
> > > flag
> > > > > the
> > > > > > > > ones that are not universally implemented as "EXPERIMENTAL"
> or
> > > > > > something
> > > > > > > > similar, and make it clear that they are optional and may be
> > > > > > deprecated if
> > > > > > > > they don't get adopted. I think that would give
> > implementations a
> > > > > > cleaner
> > > > > > > > baseline to compare against, then we can start adding core
> > > features
> > > > > on
> > > > > > top
> > > > > > > > of that.
> > > > > > > >
> > > > > > > > IMO we shouldn't promote features to being part of the core
> > spec
> > > > > until
> > > > > > > > there are at least 2 reference implementations that work with
> > > each
> > > > > > other.
> > > > > > > > That way we can use the reference implementations to clarify
> > any
> > > > bugs
> > > > > > in
> > > > > > > > the spec.
> > > > > > > >
> > > > > > > > This is all kinda hard and a little thankless so I don't
> blame
> > > > anyone
> > > > > > for
> > > > > > > > not taking it on yet!
> > > > > > > >
> > > > > > > > > * The thrift definition says that this is "Byte offset in
> > > > file_path
> > > > > > to
> > > > > > > > the ColumnMetaData" but this doesn't make too much sense, as
> > the
> > > > > > > > ColumnMetaData is contained inside the ColumnChunk itself
> (and
> > > > > > therefore,
> > > > > > > > you cannot even know its offset when writing the column
> chunk).
> > > > > > > > It probably makes sense to update the spec to reflect what
> > actual
> > > > > > > > implementations do.
> > > > > > > >
> > > > > > > > > * This is defined to be int16_t, which can quickly overflow
> > for
> > > > > > very
> > > > > > > > large parquet files or smaller row groups. What should I do
> if
> > I
> > > > > > anticipate
> > > > > > > > that my library will be used to write files where this will
> > > > overflow?
> > > > > > Just
> > > > > > > > use it for the first 2^15 row groups and then leave it out?
> Or
> > > > don't
> > > > > > write
> > > > > > > > it at all for any row group?
> > > > > > > >
> > > > > > > > 16k+ row groups in a file seems goofy to me. I can't think
> of a
> > > way
> > > > > > that
> > > > > > > > you could hit that limit with a reasonable file layout. Row
> > > groups
> > > > > are
> > > > > > > > meant to be big enough that you amortise the metadata
> overhead
> > > and
> > > > > get
> > > > > > good
> > > > > > > > compression over a column chunk. I'd expect row groups to be
> > 10s
> > > of
> > > > > MB
> > > > > > at
> > > > > > > > least typically. Maybe this would be something that should be
> > > > > > clarified in
> > > > > > > > the spec or other docs.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Fri, Oct 16, 2020 at 3:46 PM Micah Kornfield <
> > > > > emkornfi...@gmail.com
> > > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > >> >
> > > > > > > >> > IMHO, shouldn't the spec mention - quite precisely - what
> > > > versions
> > > > > > exist
> > > > > > > >> > and what features can be used in which version, so an
> > > > > > implementation can
> > > > > > > >> > say "yes, I can fully write this versions" or "no, I
> can't"
> > > > > instead
> > > > > > of
> > > > > > > >> > having a fuzzy set of features where some are described to
> > > "not
> > > > > > work on
> > > > > > > >> > most readers". Even if such a precise mapping isn't
> defined
> > > > today,
> > > > > > > >> > shouldn't it be defined at least in retrospect, so that
> > > > > > implementations
> > > > > > > >> can
> > > > > > > >> > start checking and documenting, what versions they can
> > write?
> > > > > > > >>
> > > > > > > >>
> > > > > > > >> Just an FYI I've already raised these concerns recently [1].
> > > But
> > > > I
> > > > > > > >> completely agree this is needed, I'd offer to update the
> docs,
> > > > but I
> > > > > > don't
> > > > > > > >> really have the historical context to do it.
> > > > > > > >>
> > > > > > > >> [1]
> > > > > > > >>
> > > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CCAK7Z5T8c4Zpwj_yWZ9Kr4gTueGrMq%2B0NAqADJ7rnkBaUqtPwuQ%40mail.gmail.com%3E
> > > > > > > >>
> > > > > > > >> On Fri, Oct 16, 2020 at 3:27 PM Jan Finis
> > > > > <jfi...@tableau.com.invalid
> > > > > > >
> > > > > > > >> wrote:
> > > > > > > >>
> > > > > > > >> > Hey folks,
> > > > > > > >> >
> > > > > > > >> > First of all, thanks for this great project!
> > > > > > > >> >
> > > > > > > >> > I am currently writing a library for reading/writing
> > parquet,
> > > > and
> > > > > I
> > > > > > am a
> > > > > > > >> > bit confused by some points, which I would like to discuss
> > > > here. I
> > > > > > think
> > > > > > > >> > they will be relevant to anyone wanting to write own
> parquet
> > > > > > > >> > reading/writing logic in a new language.
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > MetaData:
> > > > > > > >> >
> > > > > > > >> > There are some fields in the metadata whose semantics is
> > > > unclear.
> > > > > > Can
> > > > > > > >> you
> > > > > > > >> > clarify this:
> > > > > > > >> >
> > > > > > > >> > ColumnChunk.file_offset:
> > > > > > > >> >
> > > > > > > >> > * The thrift definition says that this is "Byte offset in
> > > > > file_path
> > > > > > to
> > > > > > > >> the
> > > > > > > >> > ColumnMetaData" but this doesn't make too much sense, as
> the
> > > > > > > >> ColumnMetaData
> > > > > > > >> > is contained inside the ColumnChunk itself (and therefore,
> > you
> > > > > > cannot
> > > > > > > >> even
> > > > > > > >> > know its offset when writing the column chunk).
> > > > > > > >> > parquet-mr seems to just write the offset of the first
> > > data/dict
> > > > > > page
> > > > > > > >> into
> > > > > > > >> > this field, which doesn't seem to comply with what the
> spec
> > > says
> > > > > > (but
> > > > > > > >> is at
> > > > > > > >> > least possible). What is this field supposed to be used
> for?
> > > My
> > > > > > reader
> > > > > > > >> just
> > > > > > > >> > ignores it, but my writer should make sure to write the
> most
> > > > > > sensible
> > > > > > > >> value
> > > > > > > >> > in here, in case some reader relies on it.
> > > > > > > >> >
> > > > > > > >> > RowGroup.ordinal:
> > > > > > > >> >
> > > > > > > >> > * This seems to be just the ordinal of the row group.
> > However,
> > > > > this
> > > > > > > >> > information seems redundant, as the row-groups meta data
> is
> > > > > > already
> > > > > > > >> stored
> > > > > > > >> > in a specific order in the footer. Should the ordinal just
> > be
> > > > > equal
> > > > > > to
> > > > > > > >> that
> > > > > > > >> > order, or can it differ?
> > > > > > > >> > * Will any reader rely on this, since it's optional?
> > > > > > > >> > * This is defined to be int16_t, which can quickly
> overflow
> > > for
> > > > > > very
> > > > > > > >> large
> > > > > > > >> > parquet files or smaller row groups. What should I do if I
> > > > > > anticipate
> > > > > > > >> that
> > > > > > > >> > my library will be used to write files where this will
> > > overflow?
> > > > > > Just
> > > > > > > >> use
> > > > > > > >> > it for the first 2^15 row groups and then leave it out? Or
> > > don't
> > > > > > write
> > > > > > > >> it
> > > > > > > >> > at all for any row group?
> > > > > > > >> >
> > > > > > > >> > Compatibility:
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >   1.  For writing: What compatibility versions are there?
> > The
> > > > spec
> > > > > > > >> talks a
> > > > > > > >> > lot about compatibility and features that not all readers
> > can
> > > > > read,
> > > > > > but
> > > > > > > >> it
> > > > > > > >> > never specifies things like "this encoding is version X.Y
> > and
> > > > > > upwards".
> > > > > > > >> > So, when writing a parquet file, I have some problems in
> > > > choosing
> > > > > > > >> features.
> > > > > > > >> > E.g.,
> > > > > > > >> > * should I write DataPageV1 or DataPageV2?
> > > > > > > >> > * Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED?
> > > > > > > >> > * Should I use BYTE_STREAM_SPLIT?
> > > > > > > >> >
> > > > > > > >> >   2.  For reading: I have implemented all encodings except
> > > > > > BIT_PACKED,
> > > > > > > >> > which seems deprecated for a long time (and would require
> > all
> > > > the
> > > > > > > >> > bitunpack-on-big-endian logic, which would be a lot of
> > work).
> > > > How
> > > > > > safe
> > > > > > > >> can
> > > > > > > >> > I be that this encoding is no longer used? When was it
> last
> > > > used?
> > > > > > Since
> > > > > > > >> > when is it deprecated?
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > IMHO, shouldn't the spec mention - quite precisely - what
> > > > versions
> > > > > > exist
> > > > > > > >> > and what features can be used in which version, so an
> > > > > > implementation can
> > > > > > > >> > say "yes, I can fully write this versions" or "no, I
> can't"
> > > > > instead
> > > > > > of
> > > > > > > >> > having a fuzzy set of features where some are described to
> > > "not
> > > > > > work on
> > > > > > > >> > most readers". Even if such a precise mapping isn't
> defined
> > > > today,
> > > > > > > >> > shouldn't it be defined at least in retrospect, so that
> > > > > > implementations
> > > > > > > >> can
> > > > > > > >> > start checking and documenting, what versions they can
> > write?
> > > > > > > >> >
> > > > > > > >> > To give you an example where I already encountered this:
> > > > > parquet-cpp
> > > > > > > >> > doesn't seem to be able to read DELTA_BINARY_PACKED yet,
> so
> > > any
> > > > > > software
> > > > > > > >> > built on top of it (e.g., pyarrow) cannot read such
> files. I
> > > > > > created
> > > > > > > >> such a
> > > > > > > >> > file with parquet-mr and was very surprised when pyarrow
> > > > couldn't
> > > > > > read
> > > > > > > >> it.
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Source of truth:
> > > > > > > >> >
> > > > > > > >> > In general, what is the agreed-upon source of truth for
> > > parquet?
> > > > > Is
> > > > > > it
> > > > > > > >> the
> > > > > > > >> > documents parquet-format, or is it the implementation in
> > > > > > parquet-mr?
> > > > > > > >> These
> > > > > > > >> > differ sometimes, so which one should I adhere to if they
> > do?
> > > > > > > >> >
> > > > > > > >> >
> > > > > > > >> > Thanks in advance for any answers.
> > > > > > > >> > Cheers,
> > > > > > > >> > Jan
> > > > > > > >> >
> > > > > > > >>
> > > > > > > >
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Parquet File Meta Data & Compatibility

Reply via email to