Re: [DISCUSS] schema_index

Alkis Evlogimenos Wed, 05 Jun 2024 13:14:29 -0700

I am with you. I much prefer to not have the index and have dense encoding
as long as I do not have to pay for it :-)


On Wed, Jun 5, 2024 at 10:04 PM Jan Finis <[email protected]> wrote:

> Okay, got it, you've already assumed low overhead parsing. Agree that it's
> then probably not a problem.
>
> Still, the "empty row group, but some other row group is not empty, and it
> is important to save the memory for the empty group" strikes me as quite an
> edge case. The issue from 2015 also doesn't mention this case, but rather
> talks about completely empty columns (in all row groups), if I see it
> correctly. I would favor a simpler design with an index into the right
> direction instead expecting readers to create that index on the fly. But I
> don't have a strong opinion here then, if we can get a footer encoding that
> is super low overhead, so this isn't a problem in practice anymore.
>
> Cheers,
> Jan
>
> Am Mi., 5. Juni 2024 um 21:43 Uhr schrieb Alkis Evlogimenos
> <[email protected]>:
>
> > (2) would take unduly long - if the metadata decoder is not performant
> > enough. The speed of the decoder strongly depends on the encoding of
> > choice. If we choose flatbuffers, 100'000 columns would parse in a few ms
> > (with verification) or some much less significant time without.
> >
> > Given the 4 steps above, the design space guides us towards:
> > - optimize for one fetch from object store
> > - zero/low overhead parsing of metadata
> > - zero/low post processing (optional)
> >
> > If we stick to thrift, you are correct: we can't expect to parse all the
> > metadata in reasonable time to do resolution for a handful of columns. We
> > would need solutions like the ones proposed by Antoine/Micah where
> > per-column metadata is parsed lazily. Those solutions have a drawback:
> when
> > a lot of columns participate they are still slow. This makes the design
> > parameters I listed above superior, because they optimize for that case
> as
> > well.
> >
> >
> > On Wed, Jun 5, 2024 at 9:30 PM Jan Finis <[email protected]> wrote:
> >
> > > I see your point, but isn't it circular reasoning?:
> > >
> > > The case that we are trying to solve is one where we have a lot of
> > columns
> > > (say 100'000), making the decoding of the per-column metadata very
> > > expensive. In your enumeration, this means that (2) would take unduly
> > long.
> > > (1) does not depend too much on the footer size, as it's mostly latency
> > > that you also pay for a small footer, so (2) becomes the dominating
> step.
> > > The PR proposes how we can get (2) fast by not even parsing the
> > per-column
> > > metadata but keeping it unparsed and only parsing it for columns we're
> > > interested in. With this proposal, we couldn't even do (3) anymore, as
> > the
> > > metadata isn't parsed.
> > >
> > > So yes, I agree with you that post processing after parsing is
> > negligible,
> > > but we want to skip the parsing altogether and make it lazy, so we
> cannot
> > > even do post processing.
> > >
> > > Does that make sense, or did I misunderstand your point?
> > >
> > > Cheers
> > > Jan
> > >
> > > Am Mi., 5. Juni 2024 um 21:09 Uhr schrieb Alkis Evlogimenos
> > > <[email protected]>:
> > >
> > > > >
> > > > > We're aiming for O(1) random access here. If we required some post
> > > > > processing, then we would fail this goal.
> > > >
> > > >
> > > > In practice what we want is things to be performant. Sometimes O(1)
> > > > matters, sometimes not. Looking at some real numbers:
> > > >
> > > > Before we start decoding columns we need to do some metadata
> > processing.
> > > To
> > > > do this an engine typically does:
> > > > 1. fetch footer: smart engines do a speculative fetch and do this in
> 1
> > > go.
> > > > Cost from popular object stores? 50ms
> > > > 2. parse thrift footer: Wide footers take 30-40ms.
> > > > 3. (optional: some engines do not do this at all) resolve thrift
> > > > footer/transform to internal data structures: typically takes 100us
> > > > 4. match schema on resolved metadata: typically takes 100ns per
> column
> > if
> > > > O(1)
> > > >
> > > > (3) doing a pass over the metadata to guarantee (4) is O(1) does not
> > fail
> > > > the goal of being fast as long as the cost of doing (3) is a lot
> > smaller
> > > > than (1) + (2). In a future version, we would shrink footers by 2x
> and
> > > > speed up parsing by 100x. Then the above would look like this:
> > > >
> > > > 1. 30ms
> > > > 2. 50us
> > > > 3. 100us
> > > > 4.  100ns/col
> > > >
> > > > It still doesn't matter if we do some lightweight postprocessing (3)
> > > given
> > > > that fetching is so slow.
> > > >
> > > >
> > > > On Wed, Jun 5, 2024 at 7:14 PM Jan Finis <[email protected]> wrote:
> > > >
> > > > > > That said, if we assume there is some post-processing done after
> > > > > > deserializing FileMetaData, one can build the forward indexing
> from
> > > > > schema
> > > > > > to column metadata *per rowgroup* with one pass. Such processing
> is
> > > > very
> > > > > > cheap: it shouldn't take more than a couple dozen micros.
> > > > >
> > > > >
> > > > > We're aiming for O(1) random access here. If we required some post
> > > > > processing, then we would fail this goal.
> > > > >
> > > > > Even better if we leave the representation dense, plus make empty
> > > > > > columns serialize to all default values of ColumnMetaData. For
> most
> > > > > > serializers (flatbuffers, protobufs) this means zero cost on the
> > > wire.
> > > > If
> > > > > > we want to avoid data pages, smart parquet writers can write one
> > page
> > > > of
> > > > > > all nulls and point all empty columns to the same page?
> > > > >
> > > > >
> > > > > Agree, this works. Then again, we wouldn't need schema_index
> anymore,
> > > so
> > > > > we're back to my initial proposal :).
> > > > >
> > > > > Avoiding writing of data pages for fully empty chunks is a
> different
> > > > issue
> > > > > and I agree we can do this. Whether it saves so much is
> questionable
> > > > > though. I would guess an empty data page is not much larger than
> the
> > > > > `ColumnMetaData` representation in FlatBuffers, so we wouldn't
> save a
> > > > lot.
> > > > > Parquet is already quite good at compressing an only-nulls page
> into
> > > > just a
> > > > > few bytes. Leaving out these few bytes I would mostly consider a
> > micro
> > > > > optimization and I would love to first see a benchmark showing that
> > > this
> > > > > optimization is really necessary.
> > > > >
> > > > > I guess with an O(1) random access format, a lot of the problems
> are
> > > > > already gone. Many empty columns were a problem in 2015 (the date
> > from
> > > > the
> > > > > mentioned issue [1]) when there were neither size statistics nor an
> > > O(1)
> > > > > random access metadata format.
> > > > >
> > > > > Cheers,
> > > > > Jan
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/PARQUET-183
> > > > >
> > > > > Am Mi., 5. Juni 2024 um 19:04 Uhr schrieb Micah Kornfield <
> > > > > [email protected]>:
> > > > >
> > > > > > >
> > > > > > > For most
> > > > > > > serializers (flatbuffers, protobufs) this means zero cost on
> the
> > > wire
> > > > > >
> > > > > >
> > > > > > It is not quite zero size on the wire, but it is worth pointing
> out
> > > the
> > > > > > SizeStatistics [1] contains all the information necessary to
> > > determine
> > > > > if a
> > > > > > column is all nulls.  Combined with statistics if they are exact,
> > it
> > > > also
> > > > > > allows one to determine if a column is entirely a single value.
> If
> > > the
> > > > > > size overhead is reasonable for those two elements, then I think
> > the
> > > > main
> > > > > > consideration is whether we should be changing the spec at some
> > point
> > > > to
> > > > > > make writing these columns entirely optional?
> > > > > >
> > > > > > Thanks,
> > > > > > Micah
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L845
> > > > > >
> > > > > > On Tue, Jun 4, 2024 at 9:58 AM Alkis Evlogimenos
> > > > > > <[email protected]> wrote:
> > > > > >
> > > > > > > The drawback with having the reverse mapping is that only empty
> > in
> > > > all
> > > > > > row
> > > > > > > groups columns can be elided. Columns that are empty in some
> row
> > > > groups
> > > > > > > can't. I do not have good stats to decide either way.
> > > > > > >
> > > > > > > That said, if we assume there is some post-processing done
> after
> > > > > > > deserializing FileMetaData, one can build the forward indexing
> > from
> > > > > > schema
> > > > > > > to column metadata *per rowgroup* with one pass. Such
> processing
> > is
> > > > > very
> > > > > > > cheap: it shouldn't take more than a couple dozen micros.
> > > > > > >
> > > > > > >
> > > > > > > Even better if we leave the representation dense, plus make
> empty
> > > > > > > columns serialize to all default values of ColumnMetaData. For
> > most
> > > > > > > serializers (flatbuffers, protobufs) this means zero cost on
> the
> > > > wire.
> > > > > If
> > > > > > > we want to avoid data pages, smart parquet writers can write
> one
> > > page
> > > > > of
> > > > > > > all nulls and point all empty columns to the same page?
> > > > > > >
> > > > > > > On Tue, Jun 4, 2024 at 3:48 PM Jan Finis <[email protected]>
> > > wrote:
> > > > > > >
> > > > > > > > I would agree that at least for our use cases, this trade off
> > > would
> > > > > not
> > > > > > > be
> > > > > > > > favorable, so we would rather always write some metadata for
> > > > "empty"
> > > > > > > > columns and therefore get random I/O into the columns array.
> > > > > > > >
> > > > > > > > If I understand the use case correctly though, then this is
> > > mostly
> > > > > > meant
> > > > > > > > for completely empty columns. Thus, storing this information
> > per
> > > > row
> > > > > > > group
> > > > > > > > seems unnecessary.
> > > > > > > >
> > > > > > > > *So what about this alternative proposal that actually
> combines
> > > the
> > > > > > > > advantages of both:*
> > > > > > > >
> > > > > > > > How about just turning things around: Instead of having a
> > > > > schema_index
> > > > > > in
> > > > > > > > the ColumnMetadata, we could have a column_metadata_index in
> > the
> > > > > > schema.
> > > > > > > If
> > > > > > > > that index is missing/-1, then this signifies that the column
> > is
> > > > > empty,
> > > > > > > so
> > > > > > > > no metadata will be present for it. With this, we would get
> the
> > > > best
> > > > > of
> > > > > > > > both worlds: We would always have O(1) random I/O even in
> case
> > of
> > > > > such
> > > > > > > > empty columns (as we would use the column_metadata_index for
> > the
> > > > > > lookup)
> > > > > > > > and we would not need to store any ColumnMetadata for empty
> > > > columns.
> > > > > > > >
> > > > > > > > After given this a second thought, this also makes more sense
> > in
> > > > > > general.
> > > > > > > > As the navigation direction is usually always from schema to
> > > > metadata
> > > > > > > (not
> > > > > > > > vice versa!), the schema should point us to the correct
> > metadata
> > > > > > instead
> > > > > > > of
> > > > > > > > the metadata pointing us to the correct schema entry.
> > > > > > > >
> > > > > > > > (I'll post this suggestion also into the PR for reference)
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Jan
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Am Di., 4. Juni 2024 um 14:20 Uhr schrieb Antoine Pitrou <
> > > > > > > > [email protected]
> > > > > > > > >:
> > > > > > > >
> > > > > > > > > On Tue, 4 Jun 2024 10:52:54 +0200
> > > > > > > > > Alkis Evlogimenos
> > > > > > > > > <[email protected]>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > Finally, one point I wanted to highlight here (I also
> > > > mentioned
> > > > > > it
> > > > > > > in
> > > > > > > > > the
> > > > > > > > > > > PR): If we want random access, we have to abolish the
> > > concept
> > > > > > that
> > > > > > > > the
> > > > > > > > > data
> > > > > > > > > > > in the columns array is in a different order than in
> the
> > > > > schema.
> > > > > > > Your
> > > > > > > > > PR
> > > > > > > > > > > [1] even added a new field schema_index for matching
> > > between
> > > > > > > > > > > ColumnMetaData and schema position, but this kills
> random
> > > > > access.
> > > > > > > If
> > > > > > > > I
> > > > > > > > > want
> > > > > > > > > > > to read the third column in the schema, then do a O(1)
> > > random
> > > > > > > access
> > > > > > > > > into
> > > > > > > > > > > the third column chunk only to notice that it's schema
> > > index
> > > > is
> > > > > > > > totally
> > > > > > > > > > > different and therefore I need a full exhaustive search
> > to
> > > > find
> > > > > > the
> > > > > > > > > column
> > > > > > > > > > > that actually belongs to the third column in the
> schema,
> > > then
> > > > > all
> > > > > > > our
> > > > > > > > > > > random access efforts are in vain.
> > > > > > > > > >
> > > > > > > > > > `schema_index` is useful to implement
> > > > > > > > > > https://issues.apache.org/jira/browse/PARQUET-183 which
> is
> > > > more
> > > > > > and
> > > > > > > > more
> > > > > > > > > > prevalent as schemata become wider.
> > > > > > > > >
> > > > > > > > > But this means of scan of all column chunk metadata in a
> row
> > > > group
> > > > > is
> > > > > > > > > required to know if a particular column exists there? Or
> am I
> > > > > missing
> > > > > > > > > something?
> > > > > > > > >
> > > > > > > > > Regards
> > > > > > > > >
> > > > > > > > > Antoine.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] schema_index

Reply via email to