dclim commented on a change in pull request #6397: Adds bloom filter aggregator
to 'druid-bloom-filters' extension
URL: https://github.com/apache/incubator-druid/pull/6397#discussion_r221792170
##########
File path: docs/content/development/extensions-core/bloom-filter.md
##########
@@ -42,4 +50,53 @@ Internally, this implementation of bloom filter uses
Murmur3 fast non-cryptograp
- 1 big endian int(That is how OutputStream works) for the number of longs in
the bitset
- big endian longs in the BloomKFilter bitset
-Note: `org.apache.hive.common.util.BloomKFilter` provides a serialize method
which can be used to serialize bloom filters to outputStream.
\ No newline at end of file
+Note: `org.apache.hive.common.util.BloomKFilter` provides a serialize method
which can be used to serialize bloom filters to outputStream.
+
+## Bloom Filter Query Aggregator
+Input for a `bloomKFilter` can also be created from a druid query with the
`bloom` aggregator.
+
+### JSON Specification of Bloom Filter Aggregator
+```json
+{
+ "type": "bloomFilter",
+ "name": <output_field_name>,
+ "maxNumEntries": <maximum_number_of_elements_for_BloomKFilter>
+ "field": <dimension_spec>
+ }
+```
+
+|Property |Description |required?
|
+|-------------------------|------------------------------|----------------------------------|
+|`type` |Aggregator Type. Should always be `bloom`|yes|
+|`name` |Output field name |yes|
+|`field` |[DimensionSpec](./../dimensionspecs.html) to add to
`org.apache.hive.common.util.BloomKFilter` | yes |
+|`maxNumEntries` |Maximum number of distinct values supported by
`org.apache.hive.common.util.BloomKFilter`, default `1500`| no |
Review comment:
I think it'd be worthwhile under `maxNumEntries` to discuss the implications
of having more elements than the value provided here. Also, any discussion on
how to choose an appropriate value here to get a given false-positive rate
would also be helpful.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]