Re: [DISCUSS] schema_index

Alkis Evlogimenos Wed, 05 Jun 2024 12:10:22 -0700

>
> We're aiming for O(1) random access here. If we required some post
> processing, then we would fail this goal.



In practice what we want is things to be performant. Sometimes O(1)
matters, sometimes not. Looking at some real numbers:

Before we start decoding columns we need to do some metadata processing. To
do this an engine typically does:
1. fetch footer: smart engines do a speculative fetch and do this in 1 go.
Cost from popular object stores? 50ms
2. parse thrift footer: Wide footers take 30-40ms.
3. (optional: some engines do not do this at all) resolve thrift
footer/transform to internal data structures: typically takes 100us
4. match schema on resolved metadata: typically takes 100ns per column if
O(1)

(3) doing a pass over the metadata to guarantee (4) is O(1) does not fail
the goal of being fast as long as the cost of doing (3) is a lot smaller
than (1) + (2). In a future version, we would shrink footers by 2x and
speed up parsing by 100x. Then the above would look like this:

1. 30ms
2. 50us
3. 100us
4.  100ns/col

It still doesn't matter if we do some lightweight postprocessing (3) given
that fetching is so slow.


On Wed, Jun 5, 2024 at 7:14 PM Jan Finis <[email protected]> wrote:

> > That said, if we assume there is some post-processing done after
> > deserializing FileMetaData, one can build the forward indexing from
> schema
> > to column metadata *per rowgroup* with one pass. Such processing is very
> > cheap: it shouldn't take more than a couple dozen micros.
>
>
> We're aiming for O(1) random access here. If we required some post
> processing, then we would fail this goal.
>
> Even better if we leave the representation dense, plus make empty
> > columns serialize to all default values of ColumnMetaData. For most
> > serializers (flatbuffers, protobufs) this means zero cost on the wire. If
> > we want to avoid data pages, smart parquet writers can write one page of
> > all nulls and point all empty columns to the same page?
>
>
> Agree, this works. Then again, we wouldn't need schema_index anymore, so
> we're back to my initial proposal :).
>
> Avoiding writing of data pages for fully empty chunks is a different issue
> and I agree we can do this. Whether it saves so much is questionable
> though. I would guess an empty data page is not much larger than the
> `ColumnMetaData` representation in FlatBuffers, so we wouldn't save a lot.
> Parquet is already quite good at compressing an only-nulls page into just a
> few bytes. Leaving out these few bytes I would mostly consider a micro
> optimization and I would love to first see a benchmark showing that this
> optimization is really necessary.
>
> I guess with an O(1) random access format, a lot of the problems are
> already gone. Many empty columns were a problem in 2015 (the date from the
> mentioned issue [1]) when there were neither size statistics nor an O(1)
> random access metadata format.
>
> Cheers,
> Jan
>
> [1] https://issues.apache.org/jira/browse/PARQUET-183
>
> Am Mi., 5. Juni 2024 um 19:04 Uhr schrieb Micah Kornfield <
> [email protected]>:
>
> > >
> > > For most
> > > serializers (flatbuffers, protobufs) this means zero cost on the wire
> >
> >
> > It is not quite zero size on the wire, but it is worth pointing out the
> > SizeStatistics [1] contains all the information necessary to determine
> if a
> > column is all nulls.  Combined with statistics if they are exact, it also
> > allows one to determine if a column is entirely a single value.  If the
> > size overhead is reasonable for those two elements, then I think the main
> > consideration is whether we should be changing the spec at some point to
> > make writing these columns entirely optional?
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L845
> >
> > On Tue, Jun 4, 2024 at 9:58 AM Alkis Evlogimenos
> > <[email protected]> wrote:
> >
> > > The drawback with having the reverse mapping is that only empty in all
> > row
> > > groups columns can be elided. Columns that are empty in some row groups
> > > can't. I do not have good stats to decide either way.
> > >
> > > That said, if we assume there is some post-processing done after
> > > deserializing FileMetaData, one can build the forward indexing from
> > schema
> > > to column metadata *per rowgroup* with one pass. Such processing is
> very
> > > cheap: it shouldn't take more than a couple dozen micros.
> > >
> > >
> > > Even better if we leave the representation dense, plus make empty
> > > columns serialize to all default values of ColumnMetaData. For most
> > > serializers (flatbuffers, protobufs) this means zero cost on the wire.
> If
> > > we want to avoid data pages, smart parquet writers can write one page
> of
> > > all nulls and point all empty columns to the same page?
> > >
> > > On Tue, Jun 4, 2024 at 3:48 PM Jan Finis <[email protected]> wrote:
> > >
> > > > I would agree that at least for our use cases, this trade off would
> not
> > > be
> > > > favorable, so we would rather always write some metadata for "empty"
> > > > columns and therefore get random I/O into the columns array.
> > > >
> > > > If I understand the use case correctly though, then this is mostly
> > meant
> > > > for completely empty columns. Thus, storing this information per row
> > > group
> > > > seems unnecessary.
> > > >
> > > > *So what about this alternative proposal that actually combines the
> > > > advantages of both:*
> > > >
> > > > How about just turning things around: Instead of having a
> schema_index
> > in
> > > > the ColumnMetadata, we could have a column_metadata_index in the
> > schema.
> > > If
> > > > that index is missing/-1, then this signifies that the column is
> empty,
> > > so
> > > > no metadata will be present for it. With this, we would get the best
> of
> > > > both worlds: We would always have O(1) random I/O even in case of
> such
> > > > empty columns (as we would use the column_metadata_index for the
> > lookup)
> > > > and we would not need to store any ColumnMetadata for empty columns.
> > > >
> > > > After given this a second thought, this also makes more sense in
> > general.
> > > > As the navigation direction is usually always from schema to metadata
> > > (not
> > > > vice versa!), the schema should point us to the correct metadata
> > instead
> > > of
> > > > the metadata pointing us to the correct schema entry.
> > > >
> > > > (I'll post this suggestion also into the PR for reference)
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > >
> > > >
> > > >
> > > > Am Di., 4. Juni 2024 um 14:20 Uhr schrieb Antoine Pitrou <
> > > > [email protected]
> > > > >:
> > > >
> > > > > On Tue, 4 Jun 2024 10:52:54 +0200
> > > > > Alkis Evlogimenos
> > > > > <[email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > Finally, one point I wanted to highlight here (I also mentioned
> > it
> > > in
> > > > > the
> > > > > > > PR): If we want random access, we have to abolish the concept
> > that
> > > > the
> > > > > data
> > > > > > > in the columns array is in a different order than in the
> schema.
> > > Your
> > > > > PR
> > > > > > > [1] even added a new field schema_index for matching between
> > > > > > > ColumnMetaData and schema position, but this kills random
> access.
> > > If
> > > > I
> > > > > want
> > > > > > > to read the third column in the schema, then do a O(1) random
> > > access
> > > > > into
> > > > > > > the third column chunk only to notice that it's schema index is
> > > > totally
> > > > > > > different and therefore I need a full exhaustive search to find
> > the
> > > > > column
> > > > > > > that actually belongs to the third column in the schema, then
> all
> > > our
> > > > > > > random access efforts are in vain.
> > > > > >
> > > > > > `schema_index` is useful to implement
> > > > > > https://issues.apache.org/jira/browse/PARQUET-183 which is more
> > and
> > > > more
> > > > > > prevalent as schemata become wider.
> > > > >
> > > > > But this means of scan of all column chunk metadata in a row group
> is
> > > > > required to know if a particular column exists there? Or am I
> missing
> > > > > something?
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] schema_index

Reply via email to