(2) would take unduly long - if the metadata decoder is not performant enough. The speed of the decoder strongly depends on the encoding of choice. If we choose flatbuffers, 100'000 columns would parse in a few ms (with verification) or some much less significant time without.
Given the 4 steps above, the design space guides us towards: - optimize for one fetch from object store - zero/low overhead parsing of metadata - zero/low post processing (optional) If we stick to thrift, you are correct: we can't expect to parse all the metadata in reasonable time to do resolution for a handful of columns. We would need solutions like the ones proposed by Antoine/Micah where per-column metadata is parsed lazily. Those solutions have a drawback: when a lot of columns participate they are still slow. This makes the design parameters I listed above superior, because they optimize for that case as well. On Wed, Jun 5, 2024 at 9:30 PM Jan Finis <[email protected]> wrote: > I see your point, but isn't it circular reasoning?: > > The case that we are trying to solve is one where we have a lot of columns > (say 100'000), making the decoding of the per-column metadata very > expensive. In your enumeration, this means that (2) would take unduly long. > (1) does not depend too much on the footer size, as it's mostly latency > that you also pay for a small footer, so (2) becomes the dominating step. > The PR proposes how we can get (2) fast by not even parsing the per-column > metadata but keeping it unparsed and only parsing it for columns we're > interested in. With this proposal, we couldn't even do (3) anymore, as the > metadata isn't parsed. > > So yes, I agree with you that post processing after parsing is negligible, > but we want to skip the parsing altogether and make it lazy, so we cannot > even do post processing. > > Does that make sense, or did I misunderstand your point? > > Cheers > Jan > > Am Mi., 5. Juni 2024 um 21:09 Uhr schrieb Alkis Evlogimenos > <[email protected]>: > > > > > > > We're aiming for O(1) random access here. If we required some post > > > processing, then we would fail this goal. > > > > > > In practice what we want is things to be performant. Sometimes O(1) > > matters, sometimes not. Looking at some real numbers: > > > > Before we start decoding columns we need to do some metadata processing. > To > > do this an engine typically does: > > 1. fetch footer: smart engines do a speculative fetch and do this in 1 > go. > > Cost from popular object stores? 50ms > > 2. parse thrift footer: Wide footers take 30-40ms. > > 3. (optional: some engines do not do this at all) resolve thrift > > footer/transform to internal data structures: typically takes 100us > > 4. match schema on resolved metadata: typically takes 100ns per column if > > O(1) > > > > (3) doing a pass over the metadata to guarantee (4) is O(1) does not fail > > the goal of being fast as long as the cost of doing (3) is a lot smaller > > than (1) + (2). In a future version, we would shrink footers by 2x and > > speed up parsing by 100x. Then the above would look like this: > > > > 1. 30ms > > 2. 50us > > 3. 100us > > 4. 100ns/col > > > > It still doesn't matter if we do some lightweight postprocessing (3) > given > > that fetching is so slow. > > > > > > On Wed, Jun 5, 2024 at 7:14 PM Jan Finis <[email protected]> wrote: > > > > > > That said, if we assume there is some post-processing done after > > > > deserializing FileMetaData, one can build the forward indexing from > > > schema > > > > to column metadata *per rowgroup* with one pass. Such processing is > > very > > > > cheap: it shouldn't take more than a couple dozen micros. > > > > > > > > > We're aiming for O(1) random access here. If we required some post > > > processing, then we would fail this goal. > > > > > > Even better if we leave the representation dense, plus make empty > > > > columns serialize to all default values of ColumnMetaData. For most > > > > serializers (flatbuffers, protobufs) this means zero cost on the > wire. > > If > > > > we want to avoid data pages, smart parquet writers can write one page > > of > > > > all nulls and point all empty columns to the same page? > > > > > > > > > Agree, this works. Then again, we wouldn't need schema_index anymore, > so > > > we're back to my initial proposal :). > > > > > > Avoiding writing of data pages for fully empty chunks is a different > > issue > > > and I agree we can do this. Whether it saves so much is questionable > > > though. I would guess an empty data page is not much larger than the > > > `ColumnMetaData` representation in FlatBuffers, so we wouldn't save a > > lot. > > > Parquet is already quite good at compressing an only-nulls page into > > just a > > > few bytes. Leaving out these few bytes I would mostly consider a micro > > > optimization and I would love to first see a benchmark showing that > this > > > optimization is really necessary. > > > > > > I guess with an O(1) random access format, a lot of the problems are > > > already gone. Many empty columns were a problem in 2015 (the date from > > the > > > mentioned issue [1]) when there were neither size statistics nor an > O(1) > > > random access metadata format. > > > > > > Cheers, > > > Jan > > > > > > [1] https://issues.apache.org/jira/browse/PARQUET-183 > > > > > > Am Mi., 5. Juni 2024 um 19:04 Uhr schrieb Micah Kornfield < > > > [email protected]>: > > > > > > > > > > > > > For most > > > > > serializers (flatbuffers, protobufs) this means zero cost on the > wire > > > > > > > > > > > > It is not quite zero size on the wire, but it is worth pointing out > the > > > > SizeStatistics [1] contains all the information necessary to > determine > > > if a > > > > column is all nulls. Combined with statistics if they are exact, it > > also > > > > allows one to determine if a column is entirely a single value. If > the > > > > size overhead is reasonable for those two elements, then I think the > > main > > > > consideration is whether we should be changing the spec at some point > > to > > > > make writing these columns entirely optional? > > > > > > > > Thanks, > > > > Micah > > > > > > > > [1] > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L845 > > > > > > > > On Tue, Jun 4, 2024 at 9:58 AM Alkis Evlogimenos > > > > <[email protected]> wrote: > > > > > > > > > The drawback with having the reverse mapping is that only empty in > > all > > > > row > > > > > groups columns can be elided. Columns that are empty in some row > > groups > > > > > can't. I do not have good stats to decide either way. > > > > > > > > > > That said, if we assume there is some post-processing done after > > > > > deserializing FileMetaData, one can build the forward indexing from > > > > schema > > > > > to column metadata *per rowgroup* with one pass. Such processing is > > > very > > > > > cheap: it shouldn't take more than a couple dozen micros. > > > > > > > > > > > > > > > Even better if we leave the representation dense, plus make empty > > > > > columns serialize to all default values of ColumnMetaData. For most > > > > > serializers (flatbuffers, protobufs) this means zero cost on the > > wire. > > > If > > > > > we want to avoid data pages, smart parquet writers can write one > page > > > of > > > > > all nulls and point all empty columns to the same page? > > > > > > > > > > On Tue, Jun 4, 2024 at 3:48 PM Jan Finis <[email protected]> > wrote: > > > > > > > > > > > I would agree that at least for our use cases, this trade off > would > > > not > > > > > be > > > > > > favorable, so we would rather always write some metadata for > > "empty" > > > > > > columns and therefore get random I/O into the columns array. > > > > > > > > > > > > If I understand the use case correctly though, then this is > mostly > > > > meant > > > > > > for completely empty columns. Thus, storing this information per > > row > > > > > group > > > > > > seems unnecessary. > > > > > > > > > > > > *So what about this alternative proposal that actually combines > the > > > > > > advantages of both:* > > > > > > > > > > > > How about just turning things around: Instead of having a > > > schema_index > > > > in > > > > > > the ColumnMetadata, we could have a column_metadata_index in the > > > > schema. > > > > > If > > > > > > that index is missing/-1, then this signifies that the column is > > > empty, > > > > > so > > > > > > no metadata will be present for it. With this, we would get the > > best > > > of > > > > > > both worlds: We would always have O(1) random I/O even in case of > > > such > > > > > > empty columns (as we would use the column_metadata_index for the > > > > lookup) > > > > > > and we would not need to store any ColumnMetadata for empty > > columns. > > > > > > > > > > > > After given this a second thought, this also makes more sense in > > > > general. > > > > > > As the navigation direction is usually always from schema to > > metadata > > > > > (not > > > > > > vice versa!), the schema should point us to the correct metadata > > > > instead > > > > > of > > > > > > the metadata pointing us to the correct schema entry. > > > > > > > > > > > > (I'll post this suggestion also into the PR for reference) > > > > > > > > > > > > Cheers, > > > > > > Jan > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Am Di., 4. Juni 2024 um 14:20 Uhr schrieb Antoine Pitrou < > > > > > > [email protected] > > > > > > >: > > > > > > > > > > > > > On Tue, 4 Jun 2024 10:52:54 +0200 > > > > > > > Alkis Evlogimenos > > > > > > > <[email protected]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > Finally, one point I wanted to highlight here (I also > > mentioned > > > > it > > > > > in > > > > > > > the > > > > > > > > > PR): If we want random access, we have to abolish the > concept > > > > that > > > > > > the > > > > > > > data > > > > > > > > > in the columns array is in a different order than in the > > > schema. > > > > > Your > > > > > > > PR > > > > > > > > > [1] even added a new field schema_index for matching > between > > > > > > > > > ColumnMetaData and schema position, but this kills random > > > access. > > > > > If > > > > > > I > > > > > > > want > > > > > > > > > to read the third column in the schema, then do a O(1) > random > > > > > access > > > > > > > into > > > > > > > > > the third column chunk only to notice that it's schema > index > > is > > > > > > totally > > > > > > > > > different and therefore I need a full exhaustive search to > > find > > > > the > > > > > > > column > > > > > > > > > that actually belongs to the third column in the schema, > then > > > all > > > > > our > > > > > > > > > random access efforts are in vain. > > > > > > > > > > > > > > > > `schema_index` is useful to implement > > > > > > > > https://issues.apache.org/jira/browse/PARQUET-183 which is > > more > > > > and > > > > > > more > > > > > > > > prevalent as schemata become wider. > > > > > > > > > > > > > > But this means of scan of all column chunk metadata in a row > > group > > > is > > > > > > > required to know if a particular column exists there? Or am I > > > missing > > > > > > > something? > > > > > > > > > > > > > > Regards > > > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
