So it seems there are at least three ways forward:
1. Leave this email chain in the archives (nothing more)
2. Incorporate some of the content of these emails into the spec as a
"Background" and "Recommendation" section
3. Write a separate blog post / other content

I would personally be inclined for #2 (and if the Jan and the maintainers
agree I could take a shot at drafting it)

Thank you,
Andrew

On Mon, Jun 3, 2024 at 9:33 AM Jan Finis <[email protected]> wrote:

> Thanks a lot for the answer, that clarifies it for me.
>
> Let me elaborate on this a bit longer. Bloom filters are a topic where a
> lot of theory is necessary to make them work effectively and the fact that
> most implementations today give the hard part (getting the NDV count right)
> to the client code makes it hard to use them effectively and easy to make
> bloom filters that won't work or will waste space.
>
> Given that the spec is also meant as help for implementations on getting
> this right, I agree that we should put quite some more theoretical basics
> into the spec, so not everyone has to reinvent the wheel and find this out
> for themselves.
>
> The blog is referring to the Rust parquet writer  settings [1] which do let
> > you vary the NDV parameter used to create the Bloom filters. There is
> > nothing in the spec to my knowledge that requires the NDV used to match
> the
> > actual NDV in the data.
>
>
> Got it! That's where my disconnect is stemming from then. I was arguing
> from the theory side, while your experience is from the rust
> implementation.
>
> I would argue that it is implied in the spec that the NDV count is the real
> NDV count, as the spec only talks about theory, not about implementation
> parameters. The spec does not talk about how to obtain such an NDV count
> though. The fact that this is a parameter in the implementation is probably
> an artifact of the implementation not tracking the NDV count itself and
> therefore expecting the user to track it and input it as a parameter. I
> would still argue that the chosen NDV count should match the real one;
> otherwise your FPP will be totally off. On the other hand, if the NDV count
> is correct, you are basically guaranteed to receive the FPP you're
> expecting.
>
> One reason we did an empirical study was to see how to match the theory to
> > practice. All the code used is open source [1] and we would welcome
> either
> > corrections to the code or help understanding why your theory based
> > argument doesn't seem to match what we measured in practice.
>
>
> As we found out together, I guess the code itself is fine. We might have
> just drawn imprecise conclusions from it. It was argued that Parquet would
> suggest a ~2MB bloom filter in the 1M cardinality case, but as it turned
> out that the 1M cardinality case was likely just a 10'000 NDV count per row
> group case, so also the Parquet spec would rather suggest a 12kB bloom
> filter. As the data in the benchmark showed clustering, the assumption
> "cardinality ~= row group NDV count" was incorrect and all further
> discrepancy followed from that.
>
>
> > 2. We were trying to find settings that would permit us to effectively
> > prune row groups without having to know the exact shape of the data up
> > front.
> >
>
> Let me put on my database theorist's hat here and be frank: This is
> impossible in general. NDV counts can be anything from 1 to millions, so
> trying to find a ballpark estimate that always works is just prone to fail.
> Sizing a bloom filter while not knowing the NDV count will almost certainly
> lead to a bloom filter that is either way too big (wasting space) or way
> too small (being useless due to false positives). A badly sized bloom
> filter has severe disadvantages, as readers will use it, wasting I/O and
> potentially doing an expensive check that might be useless due to a too
> small bloom filter that just always returns true. There is so much to lose
> that you're probably better off not writing a bloom filter at all in this
> case. A bloom filter is a data structure that just needs knowledge of the
> (real) NDV count to be effective.
>
> And as you see in the table in the Parquet spec, bloom filter quality is
> *very* dependent on good NDV estimates. E.g., the difference between 10%
> FPP and 0.1% FPP is only a factor of 2.81. So, if you get your NDV
> guestimate wrong by just 3x, you might get a bloom filter with 10% FPP,
> even though you expected one with 0.1% FPP. Or with the number of the
> benchmark: The bloom filter for the 1M case was good with NDV count 5000
> (91% pruned) but was mostly useless with 1000 (5% pruned).
>
> So if you want to use bloom filters and expect them to be effective, there
> is no way around having a good NDV estimate. I would argue that all Parquet
> writer implementations should track this through sketches, but I see that
> the open source implementations do not do this yet. This should be somewhat
> straightforward to implement (as there are open source sketch
> implementations available [1][2]).
>
> As long as your Parquet writer implementation of choice doesn't do this, I
> would advise to do this yourself in client code. Pick a number of rows that
> is roughly your row group size and compute and NDV estimate through a
> sketch (this should be orders of magnitude cheaper than writing the Parquet
> file, so it shouldn't matter too much performance wise). Then input this
> estimate as NDV parameter into the implementation.
>
> The only disadvantage to mention is that when using sketches to estimate
> bloom filter size, you need two passes over the data per row group. One to
> estimate the bloom filter size, and then one to fill the bloom filter.
> However, given that you otherwise risk the bloom filter being ineffective,
> this has to be worth it.
>
> With this common understanding, I would like to draw a conclusion. Things
> along the lines of this should probably also be mentioned in the spec.
>
>
>    - Bloom filters are amazing if you got the NDV count right; otherwise
>    they quickly deteriorate into space waste or very high false positive
>    probability.
>    - In the case of Parquet bloom filters, what matters is the NDV count in
>    the respective row group, not the NDV count / cardinality of your data
> set.
>    The two can be equal if you have no clustering or ordering, but a lot of
>    data in the real world is clustered, so don't expect that you can infer
> one
>    from the other.
>    - Sketches can help you to compute an NDV count that is good enough
>    basically for free (compared to the cost of creating the bloom filter),
> but
>    the writer will need to perform two passes over each row group.
>
>
> And finally, one thing that the blog didn't measure: Bloom filters can
> get *very
> big*, if the NDV counts of your row group is close to the number of values
> (i.e., the column is mostly unique). Since the blog only went up to an NDV
> count of ~10'000, it didn't measure the case where we have a (true) NDV
> count of 1'000'000 for a 1'000'000 row group.
>
> As the table in the spec mentions, we need 16.2 bit (roughly 2 byte) per
> unique value if we want 0.1% FPP. So let's say we have a unique 4 byte
> integer column. Then the values are 4 byte, while the bloom filter is 2
> byte per value, so the bloom filter overhead is 33%. But let's say we have
> a pattern in the data that allows us to encode them more efficiently.
>
> For example, say that the values are mostly incrementing with some gaps, so
> DELTA_BINARY_PACKED encoding can be used to mostly store 4 bit deltas.
> Suddenly, each encoded value might be ~0.5 bytes, but the bloom filter
> cannot use this compression, so it is still 2 bytes per value. Boom, we
> have just created a bloom filter that is 4 times larger than the data
> itself, ouch!
>
> So size-wise, the conclusion should be:
>
>    - Bloom filters are small if (true) NDV count (per row group) << row
>    count (per row group). Once the NDV count approaches the row count, a
> bloom
>    filter's size can become as big as the data itself or even many times
>    larger than the data itself.
>
>
> Sorry for elaborating so long here. This topic is quite dear to me, as I
> also needed to learn all of this the hard way. Our company also wanted to
> use bloom filters and we were discussing for so long why they didn't work
> as expected, so I needed to work myself into all that theory. And I feel
> that this should just be in the spec, so everyone can read up on it without
> having to derive it themselves.
>
> Cheers,
> Jan
>
> [1] phttps://docs.rs/hyperloglogplus/latest/hyperloglogplus/
> [2] https://datasketches.apache.org/docs/Community/Downloads.html
>
>
> Am Mo., 3. Juni 2024 um 13:45 Uhr schrieb Andrew Lamb <
> [email protected]>:
>
> > Thanks Jan -- interesting discussion
> >
> > Some points:
> > 1. We used the Rust implementation of parquet (not parquet-java).
> > 2. We were trying to find settings that would permit us to effectively
> > prune row groups without having to know the exact shape of the data up
> > front.
> > 3. The blog is an attempt to help others figure out how to configure
> > parquet writers to get effective bloom filter pruning.
> >
> > > The blog post claims that just using 8KB instead of 2MB works just fine
> > as well. But that
> > doesn't make sense, theory wise, unless the FPP formula is wrong (which
> it
> > isn't; it's proven).
> >
> > One reason we did an empirical study was to see how to match the theory
> to
> > practice. All the code used is open source [1] and we would welcome
> either
> > corrections to the code or help understanding why your theory based
> > argument doesn't seem to match what we measured in practice.
> >
> > > Side comment: First, the blog post talks about "varying the NDV
> > parameter".
> > However, NDV isn't a parameter that one can vary freely. It is supposed
> to
> > be the *true* NDV count in the data set. That is, for Parquet it should
> be
> > the true NDV count in the row group for which the bloom filter is built.
> > Nothing to vary here.
> >
> > The blog is referring to the Rust parquet writer  settings [1] which do
> let
> > you vary the NDV parameter used to create the Bloom filters. There is
> > nothing in the spec to my knowledge that requires the NDV used to match
> the
> > actual NDV in the data.
> >
> > > I guess your experiment actually measures the best case, as then your
> > numbers are in perfect sync with theory: With 8kb (6.4% FPP), over 90%
> row
> > groups are filtered, while with 16kb (0.2% FPP) all row groups are
> > correctly pruned. But this result just confirms that the sizing formula
> > that Parquet is correct.
> >
> > This seems like an excellent conclusion to me. I don't think we meant to
> > imply there is anything incorrect with the spec.
> >
> > Andrew
> >
> >
> > [1] https://github.com/influxdata/parquet-bloom-filter-analysis
> >
> > >
> > > [2]
> >
> >
> https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_bloom_filter_ndv
> >
> >
> >
> > > Now, if I understand the claims of the blog post correctly, the blog
> post
> > > basically says that if you underestimate the NDV count heavily, you
> still
> > > get great pruning results. And this just contradicts the FPP formula.
> As
> > > just elaborated, by underestimating the NDV count, you basically should
> > get
> > > a higher FPP, which should reduce the number of row groups you can
> prune.
> > > But the measurement seems to hint that you can still prune perfectly.
> So
> > > basically, the blog post empirically finds that the FFP formula for
> bloom
> > > filters is wrong (or do I misunderstand the blog post here?).
> > >
> > > *Now comes the important point and I think here is where the problem
> > lies:*
> > > The blog post doesn't say by how much the NDV is underestimated though.
> > It
> > > mentions the true NDV count of the whole data set (called cardinality
> in
> > > the blog post), but it does not mention what that means for the true
> NDV
> > > count *per row group*. The latter is dependent on the sorting of the
> > data.
> > > Is it sorted? Or is it random? In the former case, you will have way
> > fewer
> > > NDVs per row group than in the latter case (around 100x fewer in the 1M
> > > cardinality case).
> > >
> > > *Let's assume the worst case *(close to random ordering of rows): For
> the
> > > 1M cardinality example, we would really also get a true 1M NDV count in
> > > each row group, so basically only unique values. In this case, per the
> > FPP
> > > formula, if you only use a 8kb bloom filter, you should get the
> following
> > > FPP:
> > > [image: formula-1-01]With k = 8, n = 1'000'000, m = 64'000 (8kb in
> bits),
> > > we get basically 1.000, so 100% false positive probability. And that
> also
> > > matches intuition: Each NDV sets 8 bits, and since you only have 64'000
> > > bits to set, but you're setting 8 million bits, each bit ends up being
> 1
> > > with almost certain probability, so you end up with a useless bloom
> > filter
> > > of only 1 bits.
> > >
> > > *Let's assume the best case:* If the values are sorted and we have 1
> > > million NDV count for 100 million values, that means we have 10'000 NDV
> > > count in each row group (Each value duplicated 100 times). For this
> case,
> > > the expected FPP with an 8kb bloom filter is 6.4% (k = 8, n = 10'000,
> m =
> > > 64'000) .
> > >
> > > For this case, the bloom filter is definitely usable. However, for this
> > > case, the default Parquet sizing formula would have also not yielded
> 2MB,
> > > but rather 12kb, so if this is the case that your benchmark measures
> then
> > > your values are pretty much in agreement with the Parquet sizing
> formula
> > > and there is just no discrepancy.
> > >
> > > I guess your experiment actually measures the best case, as then your
> > > numbers are in perfect sync with theory: With 8kb (6.4% FPP), over 90%
> > row
> > > groups are filtered, while with 16kb (0.2% FPP) all row groups are
> > > correctly pruned. But this result just confirms that the sizing formula
> > > that Parquet is correct.
> > >
> > > So could it be that the blog post accidentally measures an NDV count of
> > > 10'000 while the intuition is given that we're dealing with an NDV
> count
> > of
> > > 1'000'000 in the 1'000'000 cardinality case?
> > >
> > > Cheers
> > > Jan
> > >
> > >
> > > Am Sa., 1. Juni 2024 um 09:37 Uhr schrieb Micah Kornfield <
> > > [email protected]>:
> > >
> > > > I think the table is useful, I think there are calculators online
> that
> > do
> > > > this pretty easily but putting it into the docs might allow at least
> > some
> > > > users to avoid unpleasant surprises.  In terms of generalizing to
> > smaller
> > > > NDV counts and there effectiveness we might just want to state the
> > result
> > > > but provide strong caveats to benchmark with their own data?
> > > >
> > > > Cheers,
> > > > Micah
> > > >
> > > >
> > > > On Fri, May 31, 2024 at 1:59 PM Andrew Lamb <[email protected]>
> > > > wrote:
> > > >
> > > > > When we first looked into Parquet bloom filters[1] it was hard to
> > > > > understand how effective they would be for a given amount of space
> > > > > overhead.
> > > > >
> > > > > When we plugged our data's cardinality into the target ndv and fpp
> > > > > parameters, it implied 2MB bloom filters *per column* per row group
> > > which
> > > > > was unacceptable.
> > > > >
> > > > > However, when we did empirical tests (see Blog[2] from Trevor
> > Hilton),
> > > we
> > > > > found 2K-8K worked quite well.
> > > > >
> > > > > Is there any interest in porting some of the information from the
> > blog
> > > > into
> > > > > the spec (specifically the tables of size based of fpp/ndv)? Or is
> > this
> > > > > better as a third-party resource / exercise for the reader?
> > > > >
> > > > > Andrew
> > > > >
> > > > > [1]: https://parquet.apache.org/docs/file-format/bloomfilter/
> > > > > [2]: https://www.influxdata.com/blog/using-parquets-bloom-filters/
> > > > >
> > > >
> > >
> >
>

Reply via email to