Re: [DISCUSS] schema_index

Jan Finis Wed, 05 Jun 2024 10:14:43 -0700

> That said, if we assume there is some post-processing done after
> deserializing FileMetaData, one can build the forward indexing from schema
> to column metadata *per rowgroup* with one pass. Such processing is very
> cheap: it shouldn't take more than a couple dozen micros.



We're aiming for O(1) random access here. If we required some post
processing, then we would fail this goal.

Even better if we leave the representation dense, plus make empty
> columns serialize to all default values of ColumnMetaData. For most
> serializers (flatbuffers, protobufs) this means zero cost on the wire. If
> we want to avoid data pages, smart parquet writers can write one page of
> all nulls and point all empty columns to the same page?


Agree, this works. Then again, we wouldn't need schema_index anymore, so
we're back to my initial proposal :).

Avoiding writing of data pages for fully empty chunks is a different issue
and I agree we can do this. Whether it saves so much is questionable
though. I would guess an empty data page is not much larger than the
`ColumnMetaData` representation in FlatBuffers, so we wouldn't save a lot.
Parquet is already quite good at compressing an only-nulls page into just a
few bytes. Leaving out these few bytes I would mostly consider a micro
optimization and I would love to first see a benchmark showing that this
optimization is really necessary.

I guess with an O(1) random access format, a lot of the problems are
already gone. Many empty columns were a problem in 2015 (the date from the
mentioned issue [1]) when there were neither size statistics nor an O(1)
random access metadata format.

Cheers,
Jan

[1] https://issues.apache.org/jira/browse/PARQUET-183

Am Mi., 5. Juni 2024 um 19:04 Uhr schrieb Micah Kornfield <
[email protected]>:

> >
> > For most
> > serializers (flatbuffers, protobufs) this means zero cost on the wire
>
>
> It is not quite zero size on the wire, but it is worth pointing out the
> SizeStatistics [1] contains all the information necessary to determine if a
> column is all nulls.  Combined with statistics if they are exact, it also
> allows one to determine if a column is entirely a single value.  If the
> size overhead is reasonable for those two elements, then I think the main
> consideration is whether we should be changing the spec at some point to
> make writing these columns entirely optional?
>
> Thanks,
> Micah
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L845
>
> On Tue, Jun 4, 2024 at 9:58 AM Alkis Evlogimenos
> <[email protected]> wrote:
>
> > The drawback with having the reverse mapping is that only empty in all
> row
> > groups columns can be elided. Columns that are empty in some row groups
> > can't. I do not have good stats to decide either way.
> >
> > That said, if we assume there is some post-processing done after
> > deserializing FileMetaData, one can build the forward indexing from
> schema
> > to column metadata *per rowgroup* with one pass. Such processing is very
> > cheap: it shouldn't take more than a couple dozen micros.
> >
> >
> > Even better if we leave the representation dense, plus make empty
> > columns serialize to all default values of ColumnMetaData. For most
> > serializers (flatbuffers, protobufs) this means zero cost on the wire. If
> > we want to avoid data pages, smart parquet writers can write one page of
> > all nulls and point all empty columns to the same page?
> >
> > On Tue, Jun 4, 2024 at 3:48 PM Jan Finis <[email protected]> wrote:
> >
> > > I would agree that at least for our use cases, this trade off would not
> > be
> > > favorable, so we would rather always write some metadata for "empty"
> > > columns and therefore get random I/O into the columns array.
> > >
> > > If I understand the use case correctly though, then this is mostly
> meant
> > > for completely empty columns. Thus, storing this information per row
> > group
> > > seems unnecessary.
> > >
> > > *So what about this alternative proposal that actually combines the
> > > advantages of both:*
> > >
> > > How about just turning things around: Instead of having a schema_index
> in
> > > the ColumnMetadata, we could have a column_metadata_index in the
> schema.
> > If
> > > that index is missing/-1, then this signifies that the column is empty,
> > so
> > > no metadata will be present for it. With this, we would get the best of
> > > both worlds: We would always have O(1) random I/O even in case of such
> > > empty columns (as we would use the column_metadata_index for the
> lookup)
> > > and we would not need to store any ColumnMetadata for empty columns.
> > >
> > > After given this a second thought, this also makes more sense in
> general.
> > > As the navigation direction is usually always from schema to metadata
> > (not
> > > vice versa!), the schema should point us to the correct metadata
> instead
> > of
> > > the metadata pointing us to the correct schema entry.
> > >
> > > (I'll post this suggestion also into the PR for reference)
> > >
> > > Cheers,
> > > Jan
> > >
> > >
> > >
> > >
> > > Am Di., 4. Juni 2024 um 14:20 Uhr schrieb Antoine Pitrou <
> > > [email protected]
> > > >:
> > >
> > > > On Tue, 4 Jun 2024 10:52:54 +0200
> > > > Alkis Evlogimenos
> > > > <[email protected]>
> > > > wrote:
> > > > > >
> > > > > > Finally, one point I wanted to highlight here (I also mentioned
> it
> > in
> > > > the
> > > > > > PR): If we want random access, we have to abolish the concept
> that
> > > the
> > > > data
> > > > > > in the columns array is in a different order than in the schema.
> > Your
> > > > PR
> > > > > > [1] even added a new field schema_index for matching between
> > > > > > ColumnMetaData and schema position, but this kills random access.
> > If
> > > I
> > > > want
> > > > > > to read the third column in the schema, then do a O(1) random
> > access
> > > > into
> > > > > > the third column chunk only to notice that it's schema index is
> > > totally
> > > > > > different and therefore I need a full exhaustive search to find
> the
> > > > column
> > > > > > that actually belongs to the third column in the schema, then all
> > our
> > > > > > random access efforts are in vain.
> > > > >
> > > > > `schema_index` is useful to implement
> > > > > https://issues.apache.org/jira/browse/PARQUET-183 which is more
> and
> > > more
> > > > > prevalent as schemata become wider.
> > > >
> > > > But this means of scan of all column chunk metadata in a row group is
> > > > required to know if a particular column exists there? Or am I missing
> > > > something?
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] schema_index

Reply via email to