Re: [DISCUSS] schema_index

Jan Finis Wed, 05 Jun 2024 13:04:18 -0700

Okay, got it, you've already assumed low overhead parsing. Agree that it's
then probably not a problem.


Still, the "empty row group, but some other row group is not empty, and it
is important to save the memory for the empty group" strikes me as quite an
edge case. The issue from 2015 also doesn't mention this case, but rather
talks about completely empty columns (in all row groups), if I see it
correctly. I would favor a simpler design with an index into the right
direction instead expecting readers to create that index on the fly. But I
don't have a strong opinion here then, if we can get a footer encoding that
is super low overhead, so this isn't a problem in practice anymore.

Cheers,
Jan

Am Mi., 5. Juni 2024 um 21:43 Uhr schrieb Alkis Evlogimenos
<[email protected]>:

> (2) would take unduly long - if the metadata decoder is not performant
> enough. The speed of the decoder strongly depends on the encoding of
> choice. If we choose flatbuffers, 100'000 columns would parse in a few ms
> (with verification) or some much less significant time without.
>
> Given the 4 steps above, the design space guides us towards:
> - optimize for one fetch from object store
> - zero/low overhead parsing of metadata
> - zero/low post processing (optional)
>
> If we stick to thrift, you are correct: we can't expect to parse all the
> metadata in reasonable time to do resolution for a handful of columns. We
> would need solutions like the ones proposed by Antoine/Micah where
> per-column metadata is parsed lazily. Those solutions have a drawback: when
> a lot of columns participate they are still slow. This makes the design
> parameters I listed above superior, because they optimize for that case as
> well.
>
>
> On Wed, Jun 5, 2024 at 9:30 PM Jan Finis <[email protected]> wrote:
>
> > I see your point, but isn't it circular reasoning?:
> >
> > The case that we are trying to solve is one where we have a lot of
> columns
> > (say 100'000), making the decoding of the per-column metadata very
> > expensive. In your enumeration, this means that (2) would take unduly
> long.
> > (1) does not depend too much on the footer size, as it's mostly latency
> > that you also pay for a small footer, so (2) becomes the dominating step.
> > The PR proposes how we can get (2) fast by not even parsing the
> per-column
> > metadata but keeping it unparsed and only parsing it for columns we're
> > interested in. With this proposal, we couldn't even do (3) anymore, as
> the
> > metadata isn't parsed.
> >
> > So yes, I agree with you that post processing after parsing is
> negligible,
> > but we want to skip the parsing altogether and make it lazy, so we cannot
> > even do post processing.
> >
> > Does that make sense, or did I misunderstand your point?
> >
> > Cheers
> > Jan
> >
> > Am Mi., 5. Juni 2024 um 21:09 Uhr schrieb Alkis Evlogimenos
> > <[email protected]>:
> >
> > > >
> > > > We're aiming for O(1) random access here. If we required some post
> > > > processing, then we would fail this goal.
> > >
> > >
> > > In practice what we want is things to be performant. Sometimes O(1)
> > > matters, sometimes not. Looking at some real numbers:
> > >
> > > Before we start decoding columns we need to do some metadata
> processing.
> > To
> > > do this an engine typically does:
> > > 1. fetch footer: smart engines do a speculative fetch and do this in 1
> > go.
> > > Cost from popular object stores? 50ms
> > > 2. parse thrift footer: Wide footers take 30-40ms.
> > > 3. (optional: some engines do not do this at all) resolve thrift
> > > footer/transform to internal data structures: typically takes 100us
> > > 4. match schema on resolved metadata: typically takes 100ns per column
> if
> > > O(1)
> > >
> > > (3) doing a pass over the metadata to guarantee (4) is O(1) does not
> fail
> > > the goal of being fast as long as the cost of doing (3) is a lot
> smaller
> > > than (1) + (2). In a future version, we would shrink footers by 2x and
> > > speed up parsing by 100x. Then the above would look like this:
> > >
> > > 1. 30ms
> > > 2. 50us
> > > 3. 100us
> > > 4.  100ns/col
> > >
> > > It still doesn't matter if we do some lightweight postprocessing (3)
> > given
> > > that fetching is so slow.
> > >
> > >
> > > On Wed, Jun 5, 2024 at 7:14 PM Jan Finis <[email protected]> wrote:
> > >
> > > > > That said, if we assume there is some post-processing done after
> > > > > deserializing FileMetaData, one can build the forward indexing from
> > > > schema
> > > > > to column metadata *per rowgroup* with one pass. Such processing is
> > > very
> > > > > cheap: it shouldn't take more than a couple dozen micros.
> > > >
> > > >
> > > > We're aiming for O(1) random access here. If we required some post
> > > > processing, then we would fail this goal.
> > > >
> > > > Even better if we leave the representation dense, plus make empty
> > > > > columns serialize to all default values of ColumnMetaData. For most
> > > > > serializers (flatbuffers, protobufs) this means zero cost on the
> > wire.
> > > If
> > > > > we want to avoid data pages, smart parquet writers can write one
> page
> > > of
> > > > > all nulls and point all empty columns to the same page?
> > > >
> > > >
> > > > Agree, this works. Then again, we wouldn't need schema_index anymore,
> > so
> > > > we're back to my initial proposal :).
> > > >
> > > > Avoiding writing of data pages for fully empty chunks is a different
> > > issue
> > > > and I agree we can do this. Whether it saves so much is questionable
> > > > though. I would guess an empty data page is not much larger than the
> > > > `ColumnMetaData` representation in FlatBuffers, so we wouldn't save a
> > > lot.
> > > > Parquet is already quite good at compressing an only-nulls page into
> > > just a
> > > > few bytes. Leaving out these few bytes I would mostly consider a
> micro
> > > > optimization and I would love to first see a benchmark showing that
> > this
> > > > optimization is really necessary.
> > > >
> > > > I guess with an O(1) random access format, a lot of the problems are
> > > > already gone. Many empty columns were a problem in 2015 (the date
> from
> > > the
> > > > mentioned issue [1]) when there were neither size statistics nor an
> > O(1)
> > > > random access metadata format.
> > > >
> > > > Cheers,
> > > > Jan
> > > >
> > > > [1] https://issues.apache.org/jira/browse/PARQUET-183
> > > >
> > > > Am Mi., 5. Juni 2024 um 19:04 Uhr schrieb Micah Kornfield <
> > > > [email protected]>:
> > > >
> > > > > >
> > > > > > For most
> > > > > > serializers (flatbuffers, protobufs) this means zero cost on the
> > wire
> > > > >
> > > > >
> > > > > It is not quite zero size on the wire, but it is worth pointing out
> > the
> > > > > SizeStatistics [1] contains all the information necessary to
> > determine
> > > > if a
> > > > > column is all nulls.  Combined with statistics if they are exact,
> it
> > > also
> > > > > allows one to determine if a column is entirely a single value.  If
> > the
> > > > > size overhead is reasonable for those two elements, then I think
> the
> > > main
> > > > > consideration is whether we should be changing the spec at some
> point
> > > to
> > > > > make writing these columns entirely optional?
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L845
> > > > >
> > > > > On Tue, Jun 4, 2024 at 9:58 AM Alkis Evlogimenos
> > > > > <[email protected]> wrote:
> > > > >
> > > > > > The drawback with having the reverse mapping is that only empty
> in
> > > all
> > > > > row
> > > > > > groups columns can be elided. Columns that are empty in some row
> > > groups
> > > > > > can't. I do not have good stats to decide either way.
> > > > > >
> > > > > > That said, if we assume there is some post-processing done after
> > > > > > deserializing FileMetaData, one can build the forward indexing
> from
> > > > > schema
> > > > > > to column metadata *per rowgroup* with one pass. Such processing
> is
> > > > very
> > > > > > cheap: it shouldn't take more than a couple dozen micros.
> > > > > >
> > > > > >
> > > > > > Even better if we leave the representation dense, plus make empty
> > > > > > columns serialize to all default values of ColumnMetaData. For
> most
> > > > > > serializers (flatbuffers, protobufs) this means zero cost on the
> > > wire.
> > > > If
> > > > > > we want to avoid data pages, smart parquet writers can write one
> > page
> > > > of
> > > > > > all nulls and point all empty columns to the same page?
> > > > > >
> > > > > > On Tue, Jun 4, 2024 at 3:48 PM Jan Finis <[email protected]>
> > wrote:
> > > > > >
> > > > > > > I would agree that at least for our use cases, this trade off
> > would
> > > > not
> > > > > > be
> > > > > > > favorable, so we would rather always write some metadata for
> > > "empty"
> > > > > > > columns and therefore get random I/O into the columns array.
> > > > > > >
> > > > > > > If I understand the use case correctly though, then this is
> > mostly
> > > > > meant
> > > > > > > for completely empty columns. Thus, storing this information
> per
> > > row
> > > > > > group
> > > > > > > seems unnecessary.
> > > > > > >
> > > > > > > *So what about this alternative proposal that actually combines
> > the
> > > > > > > advantages of both:*
> > > > > > >
> > > > > > > How about just turning things around: Instead of having a
> > > > schema_index
> > > > > in
> > > > > > > the ColumnMetadata, we could have a column_metadata_index in
> the
> > > > > schema.
> > > > > > If
> > > > > > > that index is missing/-1, then this signifies that the column
> is
> > > > empty,
> > > > > > so
> > > > > > > no metadata will be present for it. With this, we would get the
> > > best
> > > > of
> > > > > > > both worlds: We would always have O(1) random I/O even in case
> of
> > > > such
> > > > > > > empty columns (as we would use the column_metadata_index for
> the
> > > > > lookup)
> > > > > > > and we would not need to store any ColumnMetadata for empty
> > > columns.
> > > > > > >
> > > > > > > After given this a second thought, this also makes more sense
> in
> > > > > general.
> > > > > > > As the navigation direction is usually always from schema to
> > > metadata
> > > > > > (not
> > > > > > > vice versa!), the schema should point us to the correct
> metadata
> > > > > instead
> > > > > > of
> > > > > > > the metadata pointing us to the correct schema entry.
> > > > > > >
> > > > > > > (I'll post this suggestion also into the PR for reference)
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Jan
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Am Di., 4. Juni 2024 um 14:20 Uhr schrieb Antoine Pitrou <
> > > > > > > [email protected]
> > > > > > > >:
> > > > > > >
> > > > > > > > On Tue, 4 Jun 2024 10:52:54 +0200
> > > > > > > > Alkis Evlogimenos
> > > > > > > > <[email protected]>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Finally, one point I wanted to highlight here (I also
> > > mentioned
> > > > > it
> > > > > > in
> > > > > > > > the
> > > > > > > > > > PR): If we want random access, we have to abolish the
> > concept
> > > > > that
> > > > > > > the
> > > > > > > > data
> > > > > > > > > > in the columns array is in a different order than in the
> > > > schema.
> > > > > > Your
> > > > > > > > PR
> > > > > > > > > > [1] even added a new field schema_index for matching
> > between
> > > > > > > > > > ColumnMetaData and schema position, but this kills random
> > > > access.
> > > > > > If
> > > > > > > I
> > > > > > > > want
> > > > > > > > > > to read the third column in the schema, then do a O(1)
> > random
> > > > > > access
> > > > > > > > into
> > > > > > > > > > the third column chunk only to notice that it's schema
> > index
> > > is
> > > > > > > totally
> > > > > > > > > > different and therefore I need a full exhaustive search
> to
> > > find
> > > > > the
> > > > > > > > column
> > > > > > > > > > that actually belongs to the third column in the schema,
> > then
> > > > all
> > > > > > our
> > > > > > > > > > random access efforts are in vain.
> > > > > > > > >
> > > > > > > > > `schema_index` is useful to implement
> > > > > > > > > https://issues.apache.org/jira/browse/PARQUET-183 which is
> > > more
> > > > > and
> > > > > > > more
> > > > > > > > > prevalent as schemata become wider.
> > > > > > > >
> > > > > > > > But this means of scan of all column chunk metadata in a row
> > > group
> > > > is
> > > > > > > > required to know if a particular column exists there? Or am I
> > > > missing
> > > > > > > > something?
> > > > > > > >
> > > > > > > > Regards
> > > > > > > >
> > > > > > > > Antoine.
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] schema_index

Reply via email to