Created PARQUET-1950 <https://issues.apache.org/jira/browse/PARQUET-1950> to track this. Feel free to comment in the jira or wait for the draft PR.
On Tue, Dec 8, 2020 at 8:06 PM Tim Armstrong <tarmstr...@cloudera.com.invalid> wrote: > I agree with Gabor's idea - I love getting into nitty gritty but this > probably isn't the right place. > > On Tue, Dec 8, 2020 at 4:26 AM Gabor Szadovszky <ga...@apache.org> wrote: > > > Without discussing further what should be included in the list of core > > features I would propose a framing of this idea. > > > > Instead of commenting the different objects in the thrift file (or in the > > other related documentations) for being "experimental" or so I think it > is > > more clear to have a separate doc in the parquet-format repo that defines > > the core features. (The thrift file is more about listing all the > features > > we are able to use.) > > This new documentation is versioned by the parquet-format releases so we > > have the exact list of core features for each release. > > > > We can also introduce a new (optional) field in the thrift file (e.g. > > complianceLevel) where any implementation can write the parquet-format > > release which core features are implemented. (Using a separate field > > instead of an additional key-value makes it easier to reference the core > > features documentation from the thrift file.) > > > > If you like this idea I am happy to create a jira and a PR where we can > > start/continue the discussion about the core features themselves. > > > > > > On Mon, Dec 7, 2020 at 9:00 PM Tim Armstrong > > <tarmstr...@cloudera.com.invalid> wrote: > > > > > > Introducing new logical types as "experimental" is a bit tricky. > > > Maybe experimental is a bad term. I think mostly new features in the > > format > > > do need to be backwards compatible and not buggy because data lasts a > > long > > > time once it's written. Maybe "incubating" or "preview" is a better > > term. I > > > guess it's important that a) they are not automatically enabled so > users > > > don't accidentally write out data other tools can't read and b) that > the > > > format is clear that readers implementing version x don't need to be > able > > > to read data that uses incubating features from version x in order to > be > > > compliant. > > > > > > >One more thing about the core features. If we finally agree on a set > of > > > > these how do we put new features in after a while? I mean how do we > > > > "version" the list of core features? Semantic versioning can be an > > > obvious > > > > choice and we can even use the version of parquet-format. In this > case > > we > > > > enforce the implementers to implement all the new core features if > they > > > > want to keep track of the versions. Is that what we want? > > > I think that would be an improvement from my point of view if we > > released a > > > new version of parquet-format with fewer core features but clear > wording > > > that the core features *must* be implemented. That could add a new > stick > > > for implementors - they are not in compliance with the standard if they > > > can't read all core feature or if they write non-compliant data. But > > > there's also a carrot - you can claim to be in compliance with the > > standard > > > and you don't need to deal with as many edge cases from other writers. > > > > > > E.g. for Impala it would be great for us to know exactly what our gaps > > are > > > in reading, instead of having to add new support when it comes up. But > I > > > think we'd have to do some things to be better Parquet citizens, e.g. > > > writing int64 instead of int96 timestamps by default. > > > > > > > > > > Could you expand on this? > > > I was probably overstating this. I took another look and it seems like > > it's > > > probably still a defensible decision, but I'm not sure that it's the > > ideal > > > choice. It gets decent compression, it's fast to encode/decode with > SIMD > > > and it's simpler than some alternatives. It was made 8 years ago and > I'm > > > not sure of the exact rationale. - > > > > > > > > > https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5 > > > . > > > > > > > > > The paper it's taken from actually shows that some other encodings > > > that handle exceptional values via patching (their PFOR schemes) > achieve > > > better compression on realistic data. > > > https://people.csail.mit.edu/jshun/6886-s19/lectures/lecture19-1.pdf > > has a > > > more accessible summary of this. I thought there had been some more > > > improvements in the literature since 2012 but I wasn't able to find > much > > so > > > may have been misremembering. I did come across this paper about > > > implementing the Parquet delta decoding more efficiently, but that > didn't > > > require changes to the format - > > > https://newtraell.cs.uchicago.edu/files/ms_paper/hajiang.pdf. > > > > > > On Mon, Dec 7, 2020 at 7:57 AM Gabor Szadovszky <ga...@apache.org> > > wrote: > > > > > > > I agree on separating the non widely used features to make the life > of > > > the > > > > implementers easier and to improve compatibility between these > > > > implementations. > > > > Meanwhile, it is not always clear how to define the core features. > For > > > > example, the encryption feature will be released soon in parquet-mr > > and I > > > > guess parquet-cpp as well (if not released yet). Would it be part of > > the > > > > core features already? > > > > > > > > Another thing is the compression. Which one would be part of the core > > > > features and which one would we like to drop (moving to > experimental)? > > > > > > > > We can skip features like column indexes from the core ones as they > are > > > not > > > > required to read the data written in the file. Meanwhile, if they are > > > never > > > > part of the core features we might generate a lot of files without > the > > > > related metadata hurting the read performance of the implementations > > > where > > > > the feature is implemented. (BTW for column indexes we already have > the > > > two > > > > cross-tested implementations.) > > > > > > > > Introducing new logical types as "experimental" is a bit tricky. In > > > > parquet-mr we do not give too much help to our consumers for logical > > > types. > > > > For example for a UUID logical type we still give the user a Binary > > > object > > > > instead of a java UUID. So waiting for other implementations does not > > > > really make sense. I also don't think it would be a good idea to keep > > > these > > > > logical types "experimental" until some of our consumers start using > > > them. > > > > Nobody wants to start using and invest in experimental features. > > > > > > > > I think encodings are an easier topic as they aren't visible for the > > > > outside (like the file schema). But we still have to list the ones we > > > would > > > > like to have in the "core features". From the parquet-mr point of > view > > > the > > > > "V1" encodings are used widely while the "V2" ones are not, so I > would > > > move > > > > the latter (RLE = 3, DELTA_BYTE_ARRAY = 7, DELTA_BINARY_PACKED = 5, > > > > RLE_DICTIONARY = 8) to the experimental category. > > > > (NOTE: For the parquet-mr implementation I would suggest to support > the > > > > experimental encodings by adding a property where the user can set > the > > > > encodings by column accepting the risk of being incompatible with > other > > > > implementations.) > > > > > > > > One more thing about the core features. If we finally agree on a set > of > > > > these how do we put new features in after a while? I mean how do we > > > > "version" the list of core features? Semantic versioning can be an > > > obvious > > > > choice and we can even use the version of parquet-format. In this > case > > we > > > > enforce the implementers to implement all the new core features if > they > > > > want to keep track of the versions. Is that what we want? > > > > > > > > > > > > > > > > On Mon, Dec 7, 2020 at 6:16 AM Micah Kornfield < > emkornfi...@gmail.com> > > > > wrote: > > > > > > > > > I tried to get some level of clarification in a PR [1]. It kind of > > > > stalled > > > > > because we had further conversation on a sync call a while ago > that I > > > > have > > > > > not had a chance to follow-up on. I'm happy to revise if we can > come > > > up > > > > > with some sort of consensus for experimental/non-experimental. > > > > > > > > > > The things that were decided on the sync call were: > > > > > 1. Does it make sense to deprecate data page v2. It sounded like > > this > > > > > isn't widely used and the main benefits are either unused > > [uncompressed > > > > > rep/def level], or are already required from other features (e.g. > > > record > > > > > level alignment within a data page is required with indices IIUC). > > > > > 2. Potentially accepting that all encodings in V1 and V2 are > > > acceptable > > > > > for V1 data pages. > > > > > > > > > > I think there is a lot more conversation here. But I agree with > > > > everything > > > > > said on this thread we more clarity on what is required/not > required > > > for > > > > > implementations. > > > > > > > > > > e.g. I don't think the delta encoding reflects current best > > practices > > > in > > > > > > the literature. > > > > > > > > > > > > > > > Could you expand on this? > > > > > > > > > > > > > > > [1] https://github.com/apache/parquet-format/pull/163 > > > > > > > > > > On Fri, Dec 4, 2020 at 3:57 PM Antoine Pitrou <solip...@pitrou.net > > > > > > wrote: > > > > > > > > > > > On Fri, 4 Dec 2020 11:21:58 -0800 > > > > > > Tim Armstrong > > > > > > <tarmstr...@cloudera.com.INVALID> wrote: > > > > > > > I probably didn't say it very clearly, but my opinion as a > > consumer > > > > of > > > > > > the > > > > > > > Parquet spec is that the format needs a reset where encodings, > > > > logical > > > > > > > types and other metadata that are not widely adopted are > removed > > > from > > > > > the > > > > > > > core spec and put in an "experimental" category from which we > can > > > > later > > > > > > > promote things based on community consensus. > > > > > > > > > > > > I'm not a participant in this community but I'm one of the people > > who > > > > > > maintain parquet-cpp as part the Arrow C++ codebase, and your > > > proposal > > > > > > sounds like a very good idea to me. > > > > > > > > > > > > Even the core parts of Parquet can be tricky to implement > > > efficiently, > > > > > > adding more and more features and encodings is IMHO very > dangerous > > > for > > > > > > the format's portability. > > > > > > > > > > > > Regards > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Dec 4, 2020 at 11:14 AM Tim Armstrong < > > > > tarmstr...@cloudera.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > I think it would be good for the project to define a core set > > of > > > > > > features > > > > > > > > that a Parquet implementation must support to be able to > > > correctly > > > > > read > > > > > > > > files all written by another compliant writer with the same > > > > version. > > > > > > > > > > > > > > > > There are then additional extensions like page indices that > are > > > not > > > > > > > > required to actually be able to read/write the files > correctly. > > > > > > > > > > > > > > > > My understanding of how we got here is that the Parquet 2.0 > > spec > > > > > > defined a > > > > > > > > number of new features and encodings that were only > implemented > > > in > > > > > > > > parquet-mr. Many of these have not gotten widely implemented, > > and > > > > I'm > > > > > > also > > > > > > > > not sure that the encodings are particularly well designed, > > e.g. > > > I > > > > > > don't > > > > > > > > think the delta encoding reflects current best practices in > the > > > > > > literature. > > > > > > > > > > > > > > > > E.g. in Impala we did not implement many of the new Parquet > 2.0 > > > > > > encodings > > > > > > > > because it was a significant amount of work and we had a lot > of > > > > > > competing > > > > > > > > priorities. It's now a chicken and egg problem because > there's > > > > isn't > > > > > > much > > > > > > > > data encoded in that way out there and thus not much > motivation > > > to > > > > > > invest > > > > > > > > in the work. We've had some similar headaches with logic > types > > > and > > > > > > > > encodings, e.g. where the standard allows several degrees of > > > > freedom > > > > > > in how > > > > > > > > to encode DECIMALs, including ones that don't make much > sense. > > > > We've > > > > > > wasted > > > > > > > > significant time adding support for DECIMALs that were > written > > in > > > > > > strange > > > > > > > > ways by another writer. > > > > > > > > > > > > > > > > In the meantime we did actually need to add other new > features > > to > > > > > > Parquet > > > > > > > > like page indices, so moved forward with those, which is part > > of > > > > how > > > > > > it's > > > > > > > > turned into a bit of a matrix. > > > > > > > > > > > > > > > > My suggestion is to go through the encodings and features and > > > flag > > > > > the > > > > > > > > ones that are not universally implemented as "EXPERIMENTAL" > or > > > > > > something > > > > > > > > similar, and make it clear that they are optional and may be > > > > > > deprecated if > > > > > > > > they don't get adopted. I think that would give > > implementations a > > > > > > cleaner > > > > > > > > baseline to compare against, then we can start adding core > > > features > > > > > on > > > > > > top > > > > > > > > of that. > > > > > > > > > > > > > > > > IMO we shouldn't promote features to being part of the core > > spec > > > > > until > > > > > > > > there are at least 2 reference implementations that work with > > > each > > > > > > other. > > > > > > > > That way we can use the reference implementations to clarify > > any > > > > bugs > > > > > > in > > > > > > > > the spec. > > > > > > > > > > > > > > > > This is all kinda hard and a little thankless so I don't > blame > > > > anyone > > > > > > for > > > > > > > > not taking it on yet! > > > > > > > > > > > > > > > > > * The thrift definition says that this is "Byte offset in > > > > file_path > > > > > > to > > > > > > > > the ColumnMetaData" but this doesn't make too much sense, as > > the > > > > > > > > ColumnMetaData is contained inside the ColumnChunk itself > (and > > > > > > therefore, > > > > > > > > you cannot even know its offset when writing the column > chunk). > > > > > > > > It probably makes sense to update the spec to reflect what > > actual > > > > > > > > implementations do. > > > > > > > > > > > > > > > > > * This is defined to be int16_t, which can quickly overflow > > for > > > > > > very > > > > > > > > large parquet files or smaller row groups. What should I do > if > > I > > > > > > anticipate > > > > > > > > that my library will be used to write files where this will > > > > overflow? > > > > > > Just > > > > > > > > use it for the first 2^15 row groups and then leave it out? > Or > > > > don't > > > > > > write > > > > > > > > it at all for any row group? > > > > > > > > > > > > > > > > 16k+ row groups in a file seems goofy to me. I can't think > of a > > > way > > > > > > that > > > > > > > > you could hit that limit with a reasonable file layout. Row > > > groups > > > > > are > > > > > > > > meant to be big enough that you amortise the metadata > overhead > > > and > > > > > get > > > > > > good > > > > > > > > compression over a column chunk. I'd expect row groups to be > > 10s > > > of > > > > > MB > > > > > > at > > > > > > > > least typically. Maybe this would be something that should be > > > > > > clarified in > > > > > > > > the spec or other docs. > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Oct 16, 2020 at 3:46 PM Micah Kornfield < > > > > > emkornfi...@gmail.com > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > >> > > > > > > > > >> > IMHO, shouldn't the spec mention - quite precisely - what > > > > versions > > > > > > exist > > > > > > > >> > and what features can be used in which version, so an > > > > > > implementation can > > > > > > > >> > say "yes, I can fully write this versions" or "no, I > can't" > > > > > instead > > > > > > of > > > > > > > >> > having a fuzzy set of features where some are described to > > > "not > > > > > > work on > > > > > > > >> > most readers". Even if such a precise mapping isn't > defined > > > > today, > > > > > > > >> > shouldn't it be defined at least in retrospect, so that > > > > > > implementations > > > > > > > >> can > > > > > > > >> > start checking and documenting, what versions they can > > write? > > > > > > > >> > > > > > > > >> > > > > > > > >> Just an FYI I've already raised these concerns recently [1]. > > > But > > > > I > > > > > > > >> completely agree this is needed, I'd offer to update the > docs, > > > > but I > > > > > > don't > > > > > > > >> really have the historical context to do it. > > > > > > > >> > > > > > > > >> [1] > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://mail-archives.apache.org/mod_mbox/parquet-dev/202010.mbox/%3CCAK7Z5T8c4Zpwj_yWZ9Kr4gTueGrMq%2B0NAqADJ7rnkBaUqtPwuQ%40mail.gmail.com%3E > > > > > > > >> > > > > > > > >> On Fri, Oct 16, 2020 at 3:27 PM Jan Finis > > > > > <jfi...@tableau.com.invalid > > > > > > > > > > > > > > >> wrote: > > > > > > > >> > > > > > > > >> > Hey folks, > > > > > > > >> > > > > > > > > >> > First of all, thanks for this great project! > > > > > > > >> > > > > > > > > >> > I am currently writing a library for reading/writing > > parquet, > > > > and > > > > > I > > > > > > am a > > > > > > > >> > bit confused by some points, which I would like to discuss > > > > here. I > > > > > > think > > > > > > > >> > they will be relevant to anyone wanting to write own > parquet > > > > > > > >> > reading/writing logic in a new language. > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > MetaData: > > > > > > > >> > > > > > > > > >> > There are some fields in the metadata whose semantics is > > > > unclear. > > > > > > Can > > > > > > > >> you > > > > > > > >> > clarify this: > > > > > > > >> > > > > > > > > >> > ColumnChunk.file_offset: > > > > > > > >> > > > > > > > > >> > * The thrift definition says that this is "Byte offset in > > > > > file_path > > > > > > to > > > > > > > >> the > > > > > > > >> > ColumnMetaData" but this doesn't make too much sense, as > the > > > > > > > >> ColumnMetaData > > > > > > > >> > is contained inside the ColumnChunk itself (and therefore, > > you > > > > > > cannot > > > > > > > >> even > > > > > > > >> > know its offset when writing the column chunk). > > > > > > > >> > parquet-mr seems to just write the offset of the first > > > data/dict > > > > > > page > > > > > > > >> into > > > > > > > >> > this field, which doesn't seem to comply with what the > spec > > > says > > > > > > (but > > > > > > > >> is at > > > > > > > >> > least possible). What is this field supposed to be used > for? > > > My > > > > > > reader > > > > > > > >> just > > > > > > > >> > ignores it, but my writer should make sure to write the > most > > > > > > sensible > > > > > > > >> value > > > > > > > >> > in here, in case some reader relies on it. > > > > > > > >> > > > > > > > > >> > RowGroup.ordinal: > > > > > > > >> > > > > > > > > >> > * This seems to be just the ordinal of the row group. > > However, > > > > > this > > > > > > > >> > information seems redundant, as the row-groups meta data > is > > > > > > already > > > > > > > >> stored > > > > > > > >> > in a specific order in the footer. Should the ordinal just > > be > > > > > equal > > > > > > to > > > > > > > >> that > > > > > > > >> > order, or can it differ? > > > > > > > >> > * Will any reader rely on this, since it's optional? > > > > > > > >> > * This is defined to be int16_t, which can quickly > overflow > > > for > > > > > > very > > > > > > > >> large > > > > > > > >> > parquet files or smaller row groups. What should I do if I > > > > > > anticipate > > > > > > > >> that > > > > > > > >> > my library will be used to write files where this will > > > overflow? > > > > > > Just > > > > > > > >> use > > > > > > > >> > it for the first 2^15 row groups and then leave it out? Or > > > don't > > > > > > write > > > > > > > >> it > > > > > > > >> > at all for any row group? > > > > > > > >> > > > > > > > > >> > Compatibility: > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > 1. For writing: What compatibility versions are there? > > The > > > > spec > > > > > > > >> talks a > > > > > > > >> > lot about compatibility and features that not all readers > > can > > > > > read, > > > > > > but > > > > > > > >> it > > > > > > > >> > never specifies things like "this encoding is version X.Y > > and > > > > > > upwards". > > > > > > > >> > So, when writing a parquet file, I have some problems in > > > > choosing > > > > > > > >> features. > > > > > > > >> > E.g., > > > > > > > >> > * should I write DataPageV1 or DataPageV2? > > > > > > > >> > * Should I use DELTA_BYTE_ARRAY/DELTA_BINARY_PACKED? > > > > > > > >> > * Should I use BYTE_STREAM_SPLIT? > > > > > > > >> > > > > > > > > >> > 2. For reading: I have implemented all encodings except > > > > > > BIT_PACKED, > > > > > > > >> > which seems deprecated for a long time (and would require > > all > > > > the > > > > > > > >> > bitunpack-on-big-endian logic, which would be a lot of > > work). > > > > How > > > > > > safe > > > > > > > >> can > > > > > > > >> > I be that this encoding is no longer used? When was it > last > > > > used? > > > > > > Since > > > > > > > >> > when is it deprecated? > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > IMHO, shouldn't the spec mention - quite precisely - what > > > > versions > > > > > > exist > > > > > > > >> > and what features can be used in which version, so an > > > > > > implementation can > > > > > > > >> > say "yes, I can fully write this versions" or "no, I > can't" > > > > > instead > > > > > > of > > > > > > > >> > having a fuzzy set of features where some are described to > > > "not > > > > > > work on > > > > > > > >> > most readers". Even if such a precise mapping isn't > defined > > > > today, > > > > > > > >> > shouldn't it be defined at least in retrospect, so that > > > > > > implementations > > > > > > > >> can > > > > > > > >> > start checking and documenting, what versions they can > > write? > > > > > > > >> > > > > > > > > >> > To give you an example where I already encountered this: > > > > > parquet-cpp > > > > > > > >> > doesn't seem to be able to read DELTA_BINARY_PACKED yet, > so > > > any > > > > > > software > > > > > > > >> > built on top of it (e.g., pyarrow) cannot read such > files. I > > > > > > created > > > > > > > >> such a > > > > > > > >> > file with parquet-mr and was very surprised when pyarrow > > > > couldn't > > > > > > read > > > > > > > >> it. > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > Source of truth: > > > > > > > >> > > > > > > > > >> > In general, what is the agreed-upon source of truth for > > > parquet? > > > > > Is > > > > > > it > > > > > > > >> the > > > > > > > >> > documents parquet-format, or is it the implementation in > > > > > > parquet-mr? > > > > > > > >> These > > > > > > > >> > differ sometimes, so which one should I adhere to if they > > do? > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > Thanks in advance for any answers. > > > > > > > >> > Cheers, > > > > > > > >> > Jan > > > > > > > >> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >