[CARBONDATA-2790][BloomDataMap]Optimize default parameter for bloomfilter datamap
To provide better query performance for bloomfilter datamap by default, we optimize bloom_size from 32000 to 640000 and optimize bloom_fpp from 0.01 to 0.00001. This closes #2567 Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/9dfa8a43 Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/9dfa8a43 Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/9dfa8a43 Branch: refs/heads/branch-1.4 Commit: 9dfa8a43872c4db3390dbaf28a86311f1723c6b0 Parents: 259868c Author: xuchuanyin <xuchuan...@hust.edu.cn> Authored: Fri Jul 27 11:54:21 2018 +0800 Committer: ravipesala <ravi.pes...@gmail.com> Committed: Thu Aug 9 23:38:51 2018 +0530 ---------------------------------------------------------------------- .../datamap/bloom/BloomCoarseGrainDataMapFactory.java | 6 +++--- docs/datamap/bloomfilter-datamap-guide.md | 4 ++-- 2 files changed, 5 insertions(+), 5 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/carbondata/blob/9dfa8a43/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java ---------------------------------------------------------------------- diff --git a/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java b/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java index 652e1fc..80a86cc 100644 --- a/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java +++ b/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java @@ -69,15 +69,15 @@ public class BloomCoarseGrainDataMapFactory extends DataMapFactory<CoarseGrainDa * default size for bloom filter, cardinality of the column. */ private static final int DEFAULT_BLOOM_FILTER_SIZE = - CarbonV3DataFormatConstants.NUMBER_OF_ROWS_PER_BLOCKLET_COLUMN_PAGE_DEFAULT; + CarbonV3DataFormatConstants.NUMBER_OF_ROWS_PER_BLOCKLET_COLUMN_PAGE_DEFAULT * 20; /** * property for fpp(false-positive-probability) of bloom filter */ private static final String BLOOM_FPP = "bloom_fpp"; /** - * default value for fpp of bloom filter is 1% + * default value for fpp of bloom filter is 0.001% */ - private static final double DEFAULT_BLOOM_FILTER_FPP = 0.01d; + private static final double DEFAULT_BLOOM_FILTER_FPP = 0.00001d; /** * property for compressing bloom while saving to disk. http://git-wip-us.apache.org/repos/asf/carbondata/blob/9dfa8a43/docs/datamap/bloomfilter-datamap-guide.md ---------------------------------------------------------------------- diff --git a/docs/datamap/bloomfilter-datamap-guide.md b/docs/datamap/bloomfilter-datamap-guide.md index 325a508..2dba3dc 100644 --- a/docs/datamap/bloomfilter-datamap-guide.md +++ b/docs/datamap/bloomfilter-datamap-guide.md @@ -83,8 +83,8 @@ User can create BloomFilter datamap using the Create DataMap DDL: | Property | Is Required | Default Value | Description | |-------------|----------|--------|---------| | INDEX_COLUMNS | YES | | Carbondata will generate BloomFilter index on these columns. Queries on there columns are usually like 'COL = VAL'. | -| BLOOM_SIZE | NO | 32000 | This value is internally used by BloomFilter as the number of expected insertions, it will affects the size of BloomFilter index. Since each blocklet has a BloomFilter here, so the value is the approximate records in a blocklet. In another word, the value 32000 * #noOfPagesInBlocklet. The value should be an integer. | -| BLOOM_FPP | NO | 0.01 | This value is internally used by BloomFilter as the False-Positive Probability, it will affects the size of bloomfilter index as well as the number of hash functions for the BloomFilter. The value should be in range (0, 1). | +| BLOOM_SIZE | NO | 640000 | This value is internally used by BloomFilter as the number of expected insertions, it will affects the size of BloomFilter index. Since each blocklet has a BloomFilter here, so the default value is the approximate distinct index values in a blocklet assuming that each blocklet contains 20 pages and each page contains 32000 records. The value should be an integer. | +| BLOOM_FPP | NO | 0.00001 | This value is internally used by BloomFilter as the False-Positive Probability, it will affects the size of bloomfilter index as well as the number of hash functions for the BloomFilter. The value should be in range (0, 1). In one test scenario, a 96GB TPCH customer table with bloom_size=320000 and bloom_fpp=0.00001 will result in 18 false positive samples. | | BLOOM_COMPRESS | NO | true | Whether to compress the BloomFilter index files. |