krylov-krylov commented on code in PR #37: URL: https://github.com/apache/parquet-site/pull/37#discussion_r1451890553
########## content/en/docs/File Format/bloomfilter.md: ########## @@ -0,0 +1,335 @@ +--- +title: "Bloom Filter" +linkTitle: "Bloom Filter" +weight: 7 +--- +### Problem statement +In their current format, column statistics and dictionaries can be used for predicate +pushdown. Statistics include minimum and maximum value, which can be used to filter out +values not in the range. Dictionaries are more specific, and readers can filter out values +that are between min and max but not in the dictionary. However, when there are too many +distinct values, writers sometimes choose not to add dictionaries because of the extra +space they occupy. This leaves columns with large cardinalities and widely separated min +and max without support for predicate pushdown. + +A [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) is a compact data structure that +overapproximates a set. It can respond to membership queries with either "definitely no" or +"probably yes", where the probability of false positives is configured when the filter is +initialized. Bloom filters do not have false negatives. + +Because Bloom filters are small compared to dictionaries, they can be used for predicate +pushdown even in columns with high cardinality and when space is at a premium. + +### Goal +* Enable predicate pushdown for high-cardinality columns while using less space than + dictionaries. + +* Induce no additional I/O overhead when executing queries on columns without Bloom + filters attached or when executing non-selective queries. + +### Technical Approach + +The section describes split block Bloom filters, which is the first +(and, at time of writing, only) Bloom filter representation supported +in Parquet. + +First we will describe a "block". This is the main component split +block Bloom filters are composed of. + +Each block is 256 bits, broken up into eight contiguous "words", each +consisting of 32 bits. Each word is thought of as an array of bits; +each bit is either "set" or "not set". + +When initialized, a block is "empty", which means each of the eight +component words has no bits set. In addition to initialization, a +block supports two other operations: `block_insert` and +`block_check`. Both take a single unsigned 32-bit integer as input; +`block_insert` returns no value, but modifies the block, while +`block_check` returns a boolean. The semantics of `block_check` are +that it must return `true` if `block_insert` was previously called on +the block with the same argument, and otherwise it returns `false` +with high probability. For more details of the probability, see below. + +The operations `block_insert` and `block_check` depend on some +auxiliary artifacts. First, there is a sequence of eight odd unsigned +32-bit integer constants called the `salt`. Second, there is a method +called `mask` that takes as its argument a single unsigned 32-bit +integer and returns a block in which each word has exactly one bit +set. + +``` +unsigned int32 salt[8] = {0x47b6137bU, 0x44974d91U, 0x8824ad5bU, + 0xa2b7289dU, 0x705495c7U, 0x2df1424bU, + 0x9efc4947U, 0x5c6bfb31U} + Review Comment: 0x678fcf657e1e90658d540bca155bca7df8f512bc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
