I see your point, but isn't it circular reasoning?: The case that we are trying to solve is one where we have a lot of columns (say 100'000), making the decoding of the per-column metadata very expensive. In your enumeration, this means that (2) would take unduly long. (1) does not depend too much on the footer size, as it's mostly latency that you also pay for a small footer, so (2) becomes the dominating step. The PR proposes how we can get (2) fast by not even parsing the per-column metadata but keeping it unparsed and only parsing it for columns we're interested in. With this proposal, we couldn't even do (3) anymore, as the metadata isn't parsed.
So yes, I agree with you that post processing after parsing is negligible, but we want to skip the parsing altogether and make it lazy, so we cannot even do post processing. Does that make sense, or did I misunderstand your point? Cheers Jan Am Mi., 5. Juni 2024 um 21:09 Uhr schrieb Alkis Evlogimenos <[email protected]>: > > > > We're aiming for O(1) random access here. If we required some post > > processing, then we would fail this goal. > > > In practice what we want is things to be performant. Sometimes O(1) > matters, sometimes not. Looking at some real numbers: > > Before we start decoding columns we need to do some metadata processing. To > do this an engine typically does: > 1. fetch footer: smart engines do a speculative fetch and do this in 1 go. > Cost from popular object stores? 50ms > 2. parse thrift footer: Wide footers take 30-40ms. > 3. (optional: some engines do not do this at all) resolve thrift > footer/transform to internal data structures: typically takes 100us > 4. match schema on resolved metadata: typically takes 100ns per column if > O(1) > > (3) doing a pass over the metadata to guarantee (4) is O(1) does not fail > the goal of being fast as long as the cost of doing (3) is a lot smaller > than (1) + (2). In a future version, we would shrink footers by 2x and > speed up parsing by 100x. Then the above would look like this: > > 1. 30ms > 2. 50us > 3. 100us > 4. 100ns/col > > It still doesn't matter if we do some lightweight postprocessing (3) given > that fetching is so slow. > > > On Wed, Jun 5, 2024 at 7:14 PM Jan Finis <[email protected]> wrote: > > > > That said, if we assume there is some post-processing done after > > > deserializing FileMetaData, one can build the forward indexing from > > schema > > > to column metadata *per rowgroup* with one pass. Such processing is > very > > > cheap: it shouldn't take more than a couple dozen micros. > > > > > > We're aiming for O(1) random access here. If we required some post > > processing, then we would fail this goal. > > > > Even better if we leave the representation dense, plus make empty > > > columns serialize to all default values of ColumnMetaData. For most > > > serializers (flatbuffers, protobufs) this means zero cost on the wire. > If > > > we want to avoid data pages, smart parquet writers can write one page > of > > > all nulls and point all empty columns to the same page? > > > > > > Agree, this works. Then again, we wouldn't need schema_index anymore, so > > we're back to my initial proposal :). > > > > Avoiding writing of data pages for fully empty chunks is a different > issue > > and I agree we can do this. Whether it saves so much is questionable > > though. I would guess an empty data page is not much larger than the > > `ColumnMetaData` representation in FlatBuffers, so we wouldn't save a > lot. > > Parquet is already quite good at compressing an only-nulls page into > just a > > few bytes. Leaving out these few bytes I would mostly consider a micro > > optimization and I would love to first see a benchmark showing that this > > optimization is really necessary. > > > > I guess with an O(1) random access format, a lot of the problems are > > already gone. Many empty columns were a problem in 2015 (the date from > the > > mentioned issue [1]) when there were neither size statistics nor an O(1) > > random access metadata format. > > > > Cheers, > > Jan > > > > [1] https://issues.apache.org/jira/browse/PARQUET-183 > > > > Am Mi., 5. Juni 2024 um 19:04 Uhr schrieb Micah Kornfield < > > [email protected]>: > > > > > > > > > > For most > > > > serializers (flatbuffers, protobufs) this means zero cost on the wire > > > > > > > > > It is not quite zero size on the wire, but it is worth pointing out the > > > SizeStatistics [1] contains all the information necessary to determine > > if a > > > column is all nulls. Combined with statistics if they are exact, it > also > > > allows one to determine if a column is entirely a single value. If the > > > size overhead is reasonable for those two elements, then I think the > main > > > consideration is whether we should be changing the spec at some point > to > > > make writing these columns entirely optional? > > > > > > Thanks, > > > Micah > > > > > > [1] > > > > > > > > > https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L845 > > > > > > On Tue, Jun 4, 2024 at 9:58 AM Alkis Evlogimenos > > > <[email protected]> wrote: > > > > > > > The drawback with having the reverse mapping is that only empty in > all > > > row > > > > groups columns can be elided. Columns that are empty in some row > groups > > > > can't. I do not have good stats to decide either way. > > > > > > > > That said, if we assume there is some post-processing done after > > > > deserializing FileMetaData, one can build the forward indexing from > > > schema > > > > to column metadata *per rowgroup* with one pass. Such processing is > > very > > > > cheap: it shouldn't take more than a couple dozen micros. > > > > > > > > > > > > Even better if we leave the representation dense, plus make empty > > > > columns serialize to all default values of ColumnMetaData. For most > > > > serializers (flatbuffers, protobufs) this means zero cost on the > wire. > > If > > > > we want to avoid data pages, smart parquet writers can write one page > > of > > > > all nulls and point all empty columns to the same page? > > > > > > > > On Tue, Jun 4, 2024 at 3:48 PM Jan Finis <[email protected]> wrote: > > > > > > > > > I would agree that at least for our use cases, this trade off would > > not > > > > be > > > > > favorable, so we would rather always write some metadata for > "empty" > > > > > columns and therefore get random I/O into the columns array. > > > > > > > > > > If I understand the use case correctly though, then this is mostly > > > meant > > > > > for completely empty columns. Thus, storing this information per > row > > > > group > > > > > seems unnecessary. > > > > > > > > > > *So what about this alternative proposal that actually combines the > > > > > advantages of both:* > > > > > > > > > > How about just turning things around: Instead of having a > > schema_index > > > in > > > > > the ColumnMetadata, we could have a column_metadata_index in the > > > schema. > > > > If > > > > > that index is missing/-1, then this signifies that the column is > > empty, > > > > so > > > > > no metadata will be present for it. With this, we would get the > best > > of > > > > > both worlds: We would always have O(1) random I/O even in case of > > such > > > > > empty columns (as we would use the column_metadata_index for the > > > lookup) > > > > > and we would not need to store any ColumnMetadata for empty > columns. > > > > > > > > > > After given this a second thought, this also makes more sense in > > > general. > > > > > As the navigation direction is usually always from schema to > metadata > > > > (not > > > > > vice versa!), the schema should point us to the correct metadata > > > instead > > > > of > > > > > the metadata pointing us to the correct schema entry. > > > > > > > > > > (I'll post this suggestion also into the PR for reference) > > > > > > > > > > Cheers, > > > > > Jan > > > > > > > > > > > > > > > > > > > > > > > > > Am Di., 4. Juni 2024 um 14:20 Uhr schrieb Antoine Pitrou < > > > > > [email protected] > > > > > >: > > > > > > > > > > > On Tue, 4 Jun 2024 10:52:54 +0200 > > > > > > Alkis Evlogimenos > > > > > > <[email protected]> > > > > > > wrote: > > > > > > > > > > > > > > > > Finally, one point I wanted to highlight here (I also > mentioned > > > it > > > > in > > > > > > the > > > > > > > > PR): If we want random access, we have to abolish the concept > > > that > > > > > the > > > > > > data > > > > > > > > in the columns array is in a different order than in the > > schema. > > > > Your > > > > > > PR > > > > > > > > [1] even added a new field schema_index for matching between > > > > > > > > ColumnMetaData and schema position, but this kills random > > access. > > > > If > > > > > I > > > > > > want > > > > > > > > to read the third column in the schema, then do a O(1) random > > > > access > > > > > > into > > > > > > > > the third column chunk only to notice that it's schema index > is > > > > > totally > > > > > > > > different and therefore I need a full exhaustive search to > find > > > the > > > > > > column > > > > > > > > that actually belongs to the third column in the schema, then > > all > > > > our > > > > > > > > random access efforts are in vain. > > > > > > > > > > > > > > `schema_index` is useful to implement > > > > > > > https://issues.apache.org/jira/browse/PARQUET-183 which is > more > > > and > > > > > more > > > > > > > prevalent as schemata become wider. > > > > > > > > > > > > But this means of scan of all column chunk metadata in a row > group > > is > > > > > > required to know if a particular column exists there? Or am I > > missing > > > > > > something? > > > > > > > > > > > > Regards > > > > > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
