[04/47] carbondata git commit: [CARBONDATA-2790][BloomDataMap]Optimize default parameter for bloomfilter datamap

ravipesala Thu, 09 Aug 2018 11:25:51 -0700

[CARBONDATA-2790][BloomDataMap]Optimize default parameter for bloomfilter 
datamap


To provide better query performance for bloomfilter datamap by default,
we optimize bloom_size from 32000 to 640000 and optimize bloom_fpp from
0.01 to 0.00001.

This closes #2567


Project: http://git-wip-us.apache.org/repos/asf/carbondata/repo
Commit: http://git-wip-us.apache.org/repos/asf/carbondata/commit/9dfa8a43
Tree: http://git-wip-us.apache.org/repos/asf/carbondata/tree/9dfa8a43
Diff: http://git-wip-us.apache.org/repos/asf/carbondata/diff/9dfa8a43

Branch: refs/heads/branch-1.4
Commit: 9dfa8a43872c4db3390dbaf28a86311f1723c6b0
Parents: 259868c
Author: xuchuanyin <xuchuan...@hust.edu.cn>
Authored: Fri Jul 27 11:54:21 2018 +0800
Committer: ravipesala <ravi.pes...@gmail.com>
Committed: Thu Aug 9 23:38:51 2018 +0530

----------------------------------------------------------------------
 .../datamap/bloom/BloomCoarseGrainDataMapFactory.java          | 6 +++---
 docs/datamap/bloomfilter-datamap-guide.md                      | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/carbondata/blob/9dfa8a43/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java
----------------------------------------------------------------------
diff --git 
a/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java
 
b/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java
index 652e1fc..80a86cc 100644
--- 
a/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java
+++ 
b/datamap/bloom/src/main/java/org/apache/carbondata/datamap/bloom/BloomCoarseGrainDataMapFactory.java
@@ -69,15 +69,15 @@ public class BloomCoarseGrainDataMapFactory extends 
DataMapFactory<CoarseGrainDa
    * default size for bloom filter, cardinality of the column.
    */
   private static final int DEFAULT_BLOOM_FILTER_SIZE =
-      
CarbonV3DataFormatConstants.NUMBER_OF_ROWS_PER_BLOCKLET_COLUMN_PAGE_DEFAULT;
+      
CarbonV3DataFormatConstants.NUMBER_OF_ROWS_PER_BLOCKLET_COLUMN_PAGE_DEFAULT * 
20;
   /**
    * property for fpp(false-positive-probability) of bloom filter
    */
   private static final String BLOOM_FPP = "bloom_fpp";
   /**
-   * default value for fpp of bloom filter is 1%
+   * default value for fpp of bloom filter is 0.001%
    */
-  private static final double DEFAULT_BLOOM_FILTER_FPP = 0.01d;
+  private static final double DEFAULT_BLOOM_FILTER_FPP = 0.00001d;
 
   /**
    * property for compressing bloom while saving to disk.

http://git-wip-us.apache.org/repos/asf/carbondata/blob/9dfa8a43/docs/datamap/bloomfilter-datamap-guide.md
----------------------------------------------------------------------
diff --git a/docs/datamap/bloomfilter-datamap-guide.md 
b/docs/datamap/bloomfilter-datamap-guide.md
index 325a508..2dba3dc 100644
--- a/docs/datamap/bloomfilter-datamap-guide.md
+++ b/docs/datamap/bloomfilter-datamap-guide.md
@@ -83,8 +83,8 @@ User can create BloomFilter datamap using the Create DataMap 
DDL:
 | Property | Is Required | Default Value | Description |
 |-------------|----------|--------|---------|
 | INDEX_COLUMNS | YES |  | Carbondata will generate BloomFilter index on these 
columns. Queries on there columns are usually like 'COL = VAL'. |
-| BLOOM_SIZE | NO | 32000 | This value is internally used by BloomFilter as 
the number of expected insertions, it will affects the size of BloomFilter 
index. Since each blocklet has a BloomFilter here, so the value is the 
approximate records in a blocklet. In another word, the value 32000 * 
#noOfPagesInBlocklet. The value should be an integer. |
-| BLOOM_FPP | NO | 0.01 | This value is internally used by BloomFilter as the 
False-Positive Probability, it will affects the size of bloomfilter index as 
well as the number of hash functions for the BloomFilter. The value should be 
in range (0, 1). |
+| BLOOM_SIZE | NO | 640000 | This value is internally used by BloomFilter as 
the number of expected insertions, it will affects the size of BloomFilter 
index. Since each blocklet has a BloomFilter here, so the default value is the 
approximate distinct index values in a blocklet assuming that each blocklet 
contains 20 pages and each page contains 32000 records. The value should be an 
integer. |
+| BLOOM_FPP | NO | 0.00001 | This value is internally used by BloomFilter as 
the False-Positive Probability, it will affects the size of bloomfilter index 
as well as the number of hash functions for the BloomFilter. The value should 
be in range (0, 1). In one test scenario, a 96GB TPCH customer table with 
bloom_size=320000 and bloom_fpp=0.00001 will result in 18 false positive 
samples. |
 | BLOOM_COMPRESS | NO | true | Whether to compress the BloomFilter index 
files. |

[04/47] carbondata git commit: [CARBONDATA-2790][BloomDataMap]Optimize default parameter for bloomfilter datamap

Reply via email to