lyang24 opened a new issue, #9761: URL: https://github.com/apache/arrow-rs/issues/9761
Hey — wanted to float an idea: [Ribbon filter](https://arxiv.org/pdf/2103.02515)(been shipping in RocksDB since 6.15) as a second option next to the existing SBBF in `bloom_filter/`. Writer picks one, reader dispatches on the thrift algorithm tag. ### Why I think it's worth talking about SBBF sits around 10 bits/key for 1% FPR. Ribbon gets close to the information-theoretic floor — ~6.7 bits/key for the same FPR. That's roughly a third off the bloom footprint. For a Parquet file with bloom on a handful of columns across a bunch of row groups, that adds up to real bytes in the footer. Where I think it actually shows up: - Cold opens from S3 / GCS — fewer bytes per `GET` - Lakehouse setups with tons of small-ish files — metadata cache holds more files - Anywhere DataFusion's `prune_by_bloom_filters` is doing real work today What I don't want to oversell: - Query throughput is ~3× slower per probe in the paper's benchmark. But the paper's own limitations section calls out that this is a throughput measurement; for uncorrelated single probes (which is what Parquet actually does) latency is memory-bound and basically a wash. - Construction is 6–25× slower per key. That's a real cost on the writer side. Probably fine for write-once lake data, probably annoying for high-QPS streaming ETL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
