So it seems there are at least three ways forward: 1. Leave this email chain in the archives (nothing more) 2. Incorporate some of the content of these emails into the spec as a "Background" and "Recommendation" section 3. Write a separate blog post / other content
I would personally be inclined for #2 (and if the Jan and the maintainers agree I could take a shot at drafting it) Thank you, Andrew On Mon, Jun 3, 2024 at 9:33 AM Jan Finis <[email protected]> wrote: > Thanks a lot for the answer, that clarifies it for me. > > Let me elaborate on this a bit longer. Bloom filters are a topic where a > lot of theory is necessary to make them work effectively and the fact that > most implementations today give the hard part (getting the NDV count right) > to the client code makes it hard to use them effectively and easy to make > bloom filters that won't work or will waste space. > > Given that the spec is also meant as help for implementations on getting > this right, I agree that we should put quite some more theoretical basics > into the spec, so not everyone has to reinvent the wheel and find this out > for themselves. > > The blog is referring to the Rust parquet writer settings [1] which do let > > you vary the NDV parameter used to create the Bloom filters. There is > > nothing in the spec to my knowledge that requires the NDV used to match > the > > actual NDV in the data. > > > Got it! That's where my disconnect is stemming from then. I was arguing > from the theory side, while your experience is from the rust > implementation. > > I would argue that it is implied in the spec that the NDV count is the real > NDV count, as the spec only talks about theory, not about implementation > parameters. The spec does not talk about how to obtain such an NDV count > though. The fact that this is a parameter in the implementation is probably > an artifact of the implementation not tracking the NDV count itself and > therefore expecting the user to track it and input it as a parameter. I > would still argue that the chosen NDV count should match the real one; > otherwise your FPP will be totally off. On the other hand, if the NDV count > is correct, you are basically guaranteed to receive the FPP you're > expecting. > > One reason we did an empirical study was to see how to match the theory to > > practice. All the code used is open source [1] and we would welcome > either > > corrections to the code or help understanding why your theory based > > argument doesn't seem to match what we measured in practice. > > > As we found out together, I guess the code itself is fine. We might have > just drawn imprecise conclusions from it. It was argued that Parquet would > suggest a ~2MB bloom filter in the 1M cardinality case, but as it turned > out that the 1M cardinality case was likely just a 10'000 NDV count per row > group case, so also the Parquet spec would rather suggest a 12kB bloom > filter. As the data in the benchmark showed clustering, the assumption > "cardinality ~= row group NDV count" was incorrect and all further > discrepancy followed from that. > > > > 2. We were trying to find settings that would permit us to effectively > > prune row groups without having to know the exact shape of the data up > > front. > > > > Let me put on my database theorist's hat here and be frank: This is > impossible in general. NDV counts can be anything from 1 to millions, so > trying to find a ballpark estimate that always works is just prone to fail. > Sizing a bloom filter while not knowing the NDV count will almost certainly > lead to a bloom filter that is either way too big (wasting space) or way > too small (being useless due to false positives). A badly sized bloom > filter has severe disadvantages, as readers will use it, wasting I/O and > potentially doing an expensive check that might be useless due to a too > small bloom filter that just always returns true. There is so much to lose > that you're probably better off not writing a bloom filter at all in this > case. A bloom filter is a data structure that just needs knowledge of the > (real) NDV count to be effective. > > And as you see in the table in the Parquet spec, bloom filter quality is > *very* dependent on good NDV estimates. E.g., the difference between 10% > FPP and 0.1% FPP is only a factor of 2.81. So, if you get your NDV > guestimate wrong by just 3x, you might get a bloom filter with 10% FPP, > even though you expected one with 0.1% FPP. Or with the number of the > benchmark: The bloom filter for the 1M case was good with NDV count 5000 > (91% pruned) but was mostly useless with 1000 (5% pruned). > > So if you want to use bloom filters and expect them to be effective, there > is no way around having a good NDV estimate. I would argue that all Parquet > writer implementations should track this through sketches, but I see that > the open source implementations do not do this yet. This should be somewhat > straightforward to implement (as there are open source sketch > implementations available [1][2]). > > As long as your Parquet writer implementation of choice doesn't do this, I > would advise to do this yourself in client code. Pick a number of rows that > is roughly your row group size and compute and NDV estimate through a > sketch (this should be orders of magnitude cheaper than writing the Parquet > file, so it shouldn't matter too much performance wise). Then input this > estimate as NDV parameter into the implementation. > > The only disadvantage to mention is that when using sketches to estimate > bloom filter size, you need two passes over the data per row group. One to > estimate the bloom filter size, and then one to fill the bloom filter. > However, given that you otherwise risk the bloom filter being ineffective, > this has to be worth it. > > With this common understanding, I would like to draw a conclusion. Things > along the lines of this should probably also be mentioned in the spec. > > > - Bloom filters are amazing if you got the NDV count right; otherwise > they quickly deteriorate into space waste or very high false positive > probability. > - In the case of Parquet bloom filters, what matters is the NDV count in > the respective row group, not the NDV count / cardinality of your data > set. > The two can be equal if you have no clustering or ordering, but a lot of > data in the real world is clustered, so don't expect that you can infer > one > from the other. > - Sketches can help you to compute an NDV count that is good enough > basically for free (compared to the cost of creating the bloom filter), > but > the writer will need to perform two passes over each row group. > > > And finally, one thing that the blog didn't measure: Bloom filters can > get *very > big*, if the NDV counts of your row group is close to the number of values > (i.e., the column is mostly unique). Since the blog only went up to an NDV > count of ~10'000, it didn't measure the case where we have a (true) NDV > count of 1'000'000 for a 1'000'000 row group. > > As the table in the spec mentions, we need 16.2 bit (roughly 2 byte) per > unique value if we want 0.1% FPP. So let's say we have a unique 4 byte > integer column. Then the values are 4 byte, while the bloom filter is 2 > byte per value, so the bloom filter overhead is 33%. But let's say we have > a pattern in the data that allows us to encode them more efficiently. > > For example, say that the values are mostly incrementing with some gaps, so > DELTA_BINARY_PACKED encoding can be used to mostly store 4 bit deltas. > Suddenly, each encoded value might be ~0.5 bytes, but the bloom filter > cannot use this compression, so it is still 2 bytes per value. Boom, we > have just created a bloom filter that is 4 times larger than the data > itself, ouch! > > So size-wise, the conclusion should be: > > - Bloom filters are small if (true) NDV count (per row group) << row > count (per row group). Once the NDV count approaches the row count, a > bloom > filter's size can become as big as the data itself or even many times > larger than the data itself. > > > Sorry for elaborating so long here. This topic is quite dear to me, as I > also needed to learn all of this the hard way. Our company also wanted to > use bloom filters and we were discussing for so long why they didn't work > as expected, so I needed to work myself into all that theory. And I feel > that this should just be in the spec, so everyone can read up on it without > having to derive it themselves. > > Cheers, > Jan > > [1] phttps://docs.rs/hyperloglogplus/latest/hyperloglogplus/ > [2] https://datasketches.apache.org/docs/Community/Downloads.html > > > Am Mo., 3. Juni 2024 um 13:45 Uhr schrieb Andrew Lamb < > [email protected]>: > > > Thanks Jan -- interesting discussion > > > > Some points: > > 1. We used the Rust implementation of parquet (not parquet-java). > > 2. We were trying to find settings that would permit us to effectively > > prune row groups without having to know the exact shape of the data up > > front. > > 3. The blog is an attempt to help others figure out how to configure > > parquet writers to get effective bloom filter pruning. > > > > > The blog post claims that just using 8KB instead of 2MB works just fine > > as well. But that > > doesn't make sense, theory wise, unless the FPP formula is wrong (which > it > > isn't; it's proven). > > > > One reason we did an empirical study was to see how to match the theory > to > > practice. All the code used is open source [1] and we would welcome > either > > corrections to the code or help understanding why your theory based > > argument doesn't seem to match what we measured in practice. > > > > > Side comment: First, the blog post talks about "varying the NDV > > parameter". > > However, NDV isn't a parameter that one can vary freely. It is supposed > to > > be the *true* NDV count in the data set. That is, for Parquet it should > be > > the true NDV count in the row group for which the bloom filter is built. > > Nothing to vary here. > > > > The blog is referring to the Rust parquet writer settings [1] which do > let > > you vary the NDV parameter used to create the Bloom filters. There is > > nothing in the spec to my knowledge that requires the NDV used to match > the > > actual NDV in the data. > > > > > I guess your experiment actually measures the best case, as then your > > numbers are in perfect sync with theory: With 8kb (6.4% FPP), over 90% > row > > groups are filtered, while with 16kb (0.2% FPP) all row groups are > > correctly pruned. But this result just confirms that the sizing formula > > that Parquet is correct. > > > > This seems like an excellent conclusion to me. I don't think we meant to > > imply there is anything incorrect with the spec. > > > > Andrew > > > > > > [1] https://github.com/influxdata/parquet-bloom-filter-analysis > > > > > > > > [2] > > > > > https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_bloom_filter_ndv > > > > > > > > > Now, if I understand the claims of the blog post correctly, the blog > post > > > basically says that if you underestimate the NDV count heavily, you > still > > > get great pruning results. And this just contradicts the FPP formula. > As > > > just elaborated, by underestimating the NDV count, you basically should > > get > > > a higher FPP, which should reduce the number of row groups you can > prune. > > > But the measurement seems to hint that you can still prune perfectly. > So > > > basically, the blog post empirically finds that the FFP formula for > bloom > > > filters is wrong (or do I misunderstand the blog post here?). > > > > > > *Now comes the important point and I think here is where the problem > > lies:* > > > The blog post doesn't say by how much the NDV is underestimated though. > > It > > > mentions the true NDV count of the whole data set (called cardinality > in > > > the blog post), but it does not mention what that means for the true > NDV > > > count *per row group*. The latter is dependent on the sorting of the > > data. > > > Is it sorted? Or is it random? In the former case, you will have way > > fewer > > > NDVs per row group than in the latter case (around 100x fewer in the 1M > > > cardinality case). > > > > > > *Let's assume the worst case *(close to random ordering of rows): For > the > > > 1M cardinality example, we would really also get a true 1M NDV count in > > > each row group, so basically only unique values. In this case, per the > > FPP > > > formula, if you only use a 8kb bloom filter, you should get the > following > > > FPP: > > > [image: formula-1-01]With k = 8, n = 1'000'000, m = 64'000 (8kb in > bits), > > > we get basically 1.000, so 100% false positive probability. And that > also > > > matches intuition: Each NDV sets 8 bits, and since you only have 64'000 > > > bits to set, but you're setting 8 million bits, each bit ends up being > 1 > > > with almost certain probability, so you end up with a useless bloom > > filter > > > of only 1 bits. > > > > > > *Let's assume the best case:* If the values are sorted and we have 1 > > > million NDV count for 100 million values, that means we have 10'000 NDV > > > count in each row group (Each value duplicated 100 times). For this > case, > > > the expected FPP with an 8kb bloom filter is 6.4% (k = 8, n = 10'000, > m = > > > 64'000) . > > > > > > For this case, the bloom filter is definitely usable. However, for this > > > case, the default Parquet sizing formula would have also not yielded > 2MB, > > > but rather 12kb, so if this is the case that your benchmark measures > then > > > your values are pretty much in agreement with the Parquet sizing > formula > > > and there is just no discrepancy. > > > > > > I guess your experiment actually measures the best case, as then your > > > numbers are in perfect sync with theory: With 8kb (6.4% FPP), over 90% > > row > > > groups are filtered, while with 16kb (0.2% FPP) all row groups are > > > correctly pruned. But this result just confirms that the sizing formula > > > that Parquet is correct. > > > > > > So could it be that the blog post accidentally measures an NDV count of > > > 10'000 while the intuition is given that we're dealing with an NDV > count > > of > > > 1'000'000 in the 1'000'000 cardinality case? > > > > > > Cheers > > > Jan > > > > > > > > > Am Sa., 1. Juni 2024 um 09:37 Uhr schrieb Micah Kornfield < > > > [email protected]>: > > > > > > > I think the table is useful, I think there are calculators online > that > > do > > > > this pretty easily but putting it into the docs might allow at least > > some > > > > users to avoid unpleasant surprises. In terms of generalizing to > > smaller > > > > NDV counts and there effectiveness we might just want to state the > > result > > > > but provide strong caveats to benchmark with their own data? > > > > > > > > Cheers, > > > > Micah > > > > > > > > > > > > On Fri, May 31, 2024 at 1:59 PM Andrew Lamb <[email protected]> > > > > wrote: > > > > > > > > > When we first looked into Parquet bloom filters[1] it was hard to > > > > > understand how effective they would be for a given amount of space > > > > > overhead. > > > > > > > > > > When we plugged our data's cardinality into the target ndv and fpp > > > > > parameters, it implied 2MB bloom filters *per column* per row group > > > which > > > > > was unacceptable. > > > > > > > > > > However, when we did empirical tests (see Blog[2] from Trevor > > Hilton), > > > we > > > > > found 2K-8K worked quite well. > > > > > > > > > > Is there any interest in porting some of the information from the > > blog > > > > into > > > > > the spec (specifically the tables of size based of fpp/ndv)? Or is > > this > > > > > better as a third-party resource / exercise for the reader? > > > > > > > > > > Andrew > > > > > > > > > > [1]: https://parquet.apache.org/docs/file-format/bloomfilter/ > > > > > [2]: https://www.influxdata.com/blog/using-parquets-bloom-filters/ > > > > > > > > > > > > > > >
