Jimexist opened a new issue, #3138:
URL: https://github.com/apache/arrow-rs/issues/3138
Thank you @Jimexist -- this is very cool. I went through the code
fairly thoroughly. I had some minor suggestions / comments for documentation
and code structure but nothing that would block merging.
I think the biggest thing I would like to discuss is "what parameters to
expose for the writer API". I was thinking, for example, will users of this
feature be able to set "fpp" and "ndv" reasonably? I suppose having the number
of distinct values before writing a parquet file is reasonable, but maybe not
the expected number of distinct values for each row group.
I did some research of other implementations. Here are the spark settingss
https://spark.apache.org/docs/latest/configuration.html
spark.sql.optimizer.runtime.bloomFilter.creationSideThreshold | 10MB | Size
threshold of the bloom filter creation side plan. Estimated size needs to be
under this value to try to inject bloom filter. | 3.3.0
-- | -- | -- | --
spark.sql.optimizer.runtime.bloomFilter.enabled | false | When true and if
one side of a shuffle join has a selective predicate, we attempt to insert a
bloom filter in the other side to reduce the amount of shuffle data. | 3.3.0
spark.sql.optimizer.runtime.bloomFilter.expectedNumItems | 1000000 | The
default number of expected items for the runtime bloomfilter | 3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumBits | 67108864 | The max
number of bits to use for the runtime bloom filter | 3.3.0
spark.sql.optimizer.runtime.bloomFilter.maxNumItems | 4000000 | The max
allowed number of expected items for the runtime bloom filter | 3.3.0
spark.sql.optimizer.runtime.bloomFilter.numBits | 8388608 | The default
number of bits to use for the runtime bloom filter | 3.3.0
the arrow parquet C++ writer seems to allow for the fpp setting
https://arrow.apache.org/docs/cpp/api/formats.html#_CPPv4N5arrow8adapters3orc12WriteOptions16bloom_filter_fppE
```
double bloom_filter_fpp = 0.05
The upper limit of the false-positive rate of the bloom filter, default 0.05.
```
Databricks seems to expose the fpp, max_fpp, and num distinct values:
https://docs.databricks.com/sql/language-manual/delta-create-bloomfilter-index.html
_Originally posted by @alamb in
https://github.com/apache/arrow-rs/pull/3119#pullrequestreview-1186585988_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]