[incubator-hudi] branch master updated: HUDI-101: added exclusion filters for signature files.

2019-05-07 Thread vbalaji
This is an automated email from the ASF dual-hosted git repository.

vbalaji pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git


The following commit(s) were added to refs/heads/master by this push:
 new e2dcef8  HUDI-101: added exclusion filters for signature files.
e2dcef8 is described below

commit e2dcef860672c140e4b66091292ff3bee3d20110
Author: Abhishek Sharma 
AuthorDate: Tue May 7 15:17:36 2019 -0400

HUDI-101: added exclusion filters for signature files.
---
 hoodie-utilities/pom.xml  | 9 +
 packaging/hoodie-hadoop-mr-bundle/pom.xml | 9 +
 packaging/hoodie-hive-bundle/pom.xml  | 9 +
 packaging/hoodie-presto-bundle/pom.xml| 9 +
 packaging/hoodie-spark-bundle/pom.xml | 9 +
 5 files changed, 45 insertions(+)

diff --git a/hoodie-utilities/pom.xml b/hoodie-utilities/pom.xml
index 00b853d..15edede 100644
--- a/hoodie-utilities/pom.xml
+++ b/hoodie-utilities/pom.xml
@@ -136,6 +136,15 @@
   
com.uber.hoodie.com.esotericsoftware.minlog.
 
   
+  
+
+  
+META-INF/*.SF
+META-INF/*.DSA
+META-INF/*.RSA
+  
+
+  
 
   
 
diff --git a/packaging/hoodie-hadoop-mr-bundle/pom.xml 
b/packaging/hoodie-hadoop-mr-bundle/pom.xml
index f25d0f7..fa6c799 100644
--- a/packaging/hoodie-hadoop-mr-bundle/pom.xml
+++ b/packaging/hoodie-hadoop-mr-bundle/pom.xml
@@ -228,6 +228,15 @@
   com.esotericsoftware:minlog
 
   
+  
+
+  
+META-INF/*.SF
+META-INF/*.DSA
+META-INF/*.RSA
+  
+
+  
   ${project.artifactId}-${project.version}
 
   
diff --git a/packaging/hoodie-hive-bundle/pom.xml 
b/packaging/hoodie-hive-bundle/pom.xml
index 6146236..9c66eff 100644
--- a/packaging/hoodie-hive-bundle/pom.xml
+++ b/packaging/hoodie-hive-bundle/pom.xml
@@ -209,6 +209,15 @@
   org.apache.derby:derby
 
   
+  
+
+  
+META-INF/*.SF
+META-INF/*.DSA
+META-INF/*.RSA
+  
+
+  
   ${project.artifactId}-${project.version}
 
   
diff --git a/packaging/hoodie-presto-bundle/pom.xml 
b/packaging/hoodie-presto-bundle/pom.xml
index 8405b90..945b9b9 100644
--- a/packaging/hoodie-presto-bundle/pom.xml
+++ b/packaging/hoodie-presto-bundle/pom.xml
@@ -187,6 +187,15 @@
   org.apache.httpcomponents:*
 
   
+  
+
+  
+META-INF/*.SF
+META-INF/*.DSA
+META-INF/*.RSA
+  
+
+  
   ${project.artifactId}-${project.version}
 
   
diff --git a/packaging/hoodie-spark-bundle/pom.xml 
b/packaging/hoodie-spark-bundle/pom.xml
index f1e3cee..b95554c 100644
--- a/packaging/hoodie-spark-bundle/pom.xml
+++ b/packaging/hoodie-spark-bundle/pom.xml
@@ -177,6 +177,15 @@
   org.apache.spark:*
 
   
+  
+
+  
+META-INF/*.SF
+META-INF/*.DSA
+META-INF/*.RSA
+  
+
+  
   ${project.artifactId}-${project.version}
 
   



[GitHub] [incubator-hudi] bvaradar merged pull request #669: HUDI-101: added exclusion filters for signature files.

2019-05-07 Thread GitBox
bvaradar merged pull request #669: HUDI-101: added exclusion filters for 
signature files.
URL: https://github.com/apache/incubator-hudi/pull/669
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] pseudomuto opened a new pull request #670: SparkUtil#initLauncher shoudn't raise when spark-defaults.conf doesn't exist

2019-05-07 Thread GitBox
pseudomuto opened a new pull request #670: SparkUtil#initLauncher shoudn't 
raise when spark-defaults.conf doesn't exist
URL: https://github.com/apache/incubator-hudi/pull/670
 
 
   @vinothchandar @n3nash 
   
   # The Problem
   
   When attempting to run the `hdfsparquetimport` command from the hoode-cli I 
ran into an issue where it raises when the file _spark-defaults.conf_ is not 
present.
   
   The `SparkLauncher` has a 
[check](https://github.com/apache/spark/blob/5e79ae3b40b76e3473288830ab958fc4834dcb33/launcher/src/main/java/org/apache/spark/launcher/AbstractLauncher.java#L45)
 that raises when this value is `null` which is the case when using a default 
spark download for example.
   
   # The Solution
   
   Since I couldn't find any reason to require this to be present, I simply 
added a `isEmpty` check to only set the properties file when the value isn't 
`null` or `""`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] abhioncbr opened a new pull request #669: HUDI-101: added exclusion filters for signature files.

2019-05-07 Thread GitBox
abhioncbr opened a new pull request #669: HUDI-101: added exclusion filters for 
signature files.
URL: https://github.com/apache/incubator-hudi/pull/669
 
 
   @bvaradar Updated individual pom files with the exclusion filter. Tested, 
there are no signature files and also Demo steps are working absolutely fine. 
Please review.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #143: Tracking ticket for folks to be added to slack group

2019-05-07 Thread GitBox
vinothchandar commented on issue #143: Tracking ticket for folks to be added to 
slack group
URL: https://github.com/apache/incubator-hudi/issues/143#issuecomment-490118753
 
 
   done.. Also please join our mailing list. 
https://hudi.apache.org/community.html !


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing

2019-05-07 Thread GitBox
vinothchandar commented on issue #666: Add support for dynamic bloom filter to 
increase efficiency of bloom filter for static sizing
URL: https://github.com/apache/incubator-hudi/pull/666#issuecomment-490112102
 
 
   >there is one on size and both of them are close to 2MB, I actually rounded 
them off to the near megabyte, there may be differences in kilobytes.
   
   Can we test with N=`50` fp=`0.1` and 10x/100x that? I think that 
will produce larger sizes/more fps. I would be surprised if dynamic provides 
much less fp's with same number of bits. All it must be doing is to use more 
bits as more entries come in. you can use something like 
https://krisives.github.io/bloom-calculator/ to design a case around this.. 
   
   I think we have to do option 1 right? In option 2 also we 'd be reading old 
and new files with different filter formats right? do we handle an exception 
and detect dynamic vs normal bf?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing

2019-05-07 Thread GitBox
vinothchandar commented on a change in pull request #666: Add support for 
dynamic bloom filter to increase efficiency of bloom filter for static sizing
URL: https://github.com/apache/incubator-hudi/pull/666#discussion_r281669970
 
 

 ##
 File path: hoodie-common/src/main/java/com/uber/hoodie/common/BloomFilter.java
 ##
 @@ -37,33 +38,49 @@
*/
   public static final double LOG2_SQUARED = Math.log(2) * Math.log(2);
 
-  private org.apache.hadoop.util.bloom.BloomFilter filter = null;
+  private org.apache.hadoop.util.bloom.BloomFilter bloomFilter = null;
 
-  public BloomFilter(int numEntries, double errorRate) {
-this(numEntries, errorRate, Hash.MURMUR_HASH);
+  private org.apache.hadoop.util.bloom.DynamicBloomFilter dynamicBloomFilter = 
null;
 
 Review comment:
   
https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/bloomfilter/DynamicBloomFilter.java
 also has one.. It'd be good to know tradeoffs each made. esp HBase and Accumulo


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing

2019-05-07 Thread GitBox
vinothchandar commented on a change in pull request #666: Add support for 
dynamic bloom filter to increase efficiency of bloom filter for static sizing
URL: https://github.com/apache/incubator-hudi/pull/666#discussion_r281670956
 
 

 ##
 File path: 
hoodie-client/src/main/java/com/uber/hoodie/io/storage/HoodieStorageWriterFactory.java
 ##
 @@ -41,7 +41,7 @@
   R extends IndexedRecord> HoodieStorageWriter 
newParquetStorageWriter(String commitTime, Path path,
   HoodieWriteConfig config, Schema schema, HoodieTable hoodieTable) throws 
IOException {
 BloomFilter filter = new BloomFilter(config.getBloomFilterNumEntries(),
-config.getBloomFilterFPP());
+config.getBloomFilterFPP(), false);
 
 Review comment:
   @n3nash without the config here, actually we would not have written dynamic 
filters at all during the tests? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing

2019-05-07 Thread GitBox
vinothchandar edited a comment on issue #666: Add support for dynamic bloom 
filter to increase efficiency of bloom filter for static sizing
URL: https://github.com/apache/incubator-hudi/pull/666#issuecomment-490112102
 
 
   >there is one on size and both of them are close to 2MB, I actually rounded 
them off to the near megabyte, there may be differences in kilobytes.
   
   Can we test with N=`50` fp=`0.1` and 10x/100x that? I think that 
will produce larger sizes/more fps. I would be surprised if dynamic provides 
much less fp's with same number of bits. All it must be doing is to use more 
bits as more entries come in. you can use something like 
https://krisives.github.io/bloom-calculator/ to design a case around this.. 
   
   If proven to work, yes we should enable DynamicBloom by default. I think we 
have to do option 1 right? In option 2 also we 'd be reading old and new files 
with different filter formats right? do we handle an exception and detect 
dynamic vs normal bf?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on issue #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing

2019-05-07 Thread GitBox
vinothchandar commented on issue #666: Add support for dynamic bloom filter to 
increase efficiency of bloom filter for static sizing
URL: https://github.com/apache/incubator-hudi/pull/666#issuecomment-490112605
 
 
   Have you serialized both the filters and see if the you can read a 
serialized DynamicBloomFilter as a BloomFilter? (Wishful thinking. but still 
could simplify things a lot if true)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing

2019-05-07 Thread GitBox
vinothchandar commented on a change in pull request #666: Add support for 
dynamic bloom filter to increase efficiency of bloom filter for static sizing
URL: https://github.com/apache/incubator-hudi/pull/666#discussion_r281664118
 
 

 ##
 File path: hoodie-common/src/main/java/com/uber/hoodie/common/BloomFilter.java
 ##
 @@ -37,33 +38,49 @@
*/
   public static final double LOG2_SQUARED = Math.log(2) * Math.log(2);
 
-  private org.apache.hadoop.util.bloom.BloomFilter filter = null;
+  private org.apache.hadoop.util.bloom.BloomFilter bloomFilter = null;
 
-  public BloomFilter(int numEntries, double errorRate) {
-this(numEntries, errorRate, Hash.MURMUR_HASH);
+  private org.apache.hadoop.util.bloom.DynamicBloomFilter dynamicBloomFilter = 
null;
 
 Review comment:
   Accumulo seems to be implementing something.. 
   
https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/bloomfilter/DynamicBloomFilter.java
 as well. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing

2019-05-07 Thread GitBox
vinothchandar commented on a change in pull request #666: Add support for 
dynamic bloom filter to increase efficiency of bloom filter for static sizing
URL: https://github.com/apache/incubator-hudi/pull/666#discussion_r281663186
 
 

 ##
 File path: 
hoodie-client/src/main/java/com/uber/hoodie/config/HoodieIndexConfig.java
 ##
 @@ -50,6 +50,9 @@
   public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL =
   "hoodie.bloom.index.input.storage" + ".level";
   public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = 
"MEMORY_AND_DISK_SER";
+  public static final String BLOOM_INDEX_ENABLE_DYNAMIC_PROP =
 
 Review comment:
   depends on how you look at it.. At the code level, its weird to suddenly 
access an index config in storage.. we can leave it here for now. but let's 
rename?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services