[incubator-hudi] branch master updated: HUDI-101: added exclusion filters for signature files.
This is an automated email from the ASF dual-hosted git repository. vbalaji pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/incubator-hudi.git The following commit(s) were added to refs/heads/master by this push: new e2dcef8 HUDI-101: added exclusion filters for signature files. e2dcef8 is described below commit e2dcef860672c140e4b66091292ff3bee3d20110 Author: Abhishek Sharma AuthorDate: Tue May 7 15:17:36 2019 -0400 HUDI-101: added exclusion filters for signature files. --- hoodie-utilities/pom.xml | 9 + packaging/hoodie-hadoop-mr-bundle/pom.xml | 9 + packaging/hoodie-hive-bundle/pom.xml | 9 + packaging/hoodie-presto-bundle/pom.xml| 9 + packaging/hoodie-spark-bundle/pom.xml | 9 + 5 files changed, 45 insertions(+) diff --git a/hoodie-utilities/pom.xml b/hoodie-utilities/pom.xml index 00b853d..15edede 100644 --- a/hoodie-utilities/pom.xml +++ b/hoodie-utilities/pom.xml @@ -136,6 +136,15 @@ com.uber.hoodie.com.esotericsoftware.minlog. + + + +META-INF/*.SF +META-INF/*.DSA +META-INF/*.RSA + + + diff --git a/packaging/hoodie-hadoop-mr-bundle/pom.xml b/packaging/hoodie-hadoop-mr-bundle/pom.xml index f25d0f7..fa6c799 100644 --- a/packaging/hoodie-hadoop-mr-bundle/pom.xml +++ b/packaging/hoodie-hadoop-mr-bundle/pom.xml @@ -228,6 +228,15 @@ com.esotericsoftware:minlog + + + +META-INF/*.SF +META-INF/*.DSA +META-INF/*.RSA + + + ${project.artifactId}-${project.version} diff --git a/packaging/hoodie-hive-bundle/pom.xml b/packaging/hoodie-hive-bundle/pom.xml index 6146236..9c66eff 100644 --- a/packaging/hoodie-hive-bundle/pom.xml +++ b/packaging/hoodie-hive-bundle/pom.xml @@ -209,6 +209,15 @@ org.apache.derby:derby + + + +META-INF/*.SF +META-INF/*.DSA +META-INF/*.RSA + + + ${project.artifactId}-${project.version} diff --git a/packaging/hoodie-presto-bundle/pom.xml b/packaging/hoodie-presto-bundle/pom.xml index 8405b90..945b9b9 100644 --- a/packaging/hoodie-presto-bundle/pom.xml +++ b/packaging/hoodie-presto-bundle/pom.xml @@ -187,6 +187,15 @@ org.apache.httpcomponents:* + + + +META-INF/*.SF +META-INF/*.DSA +META-INF/*.RSA + + + ${project.artifactId}-${project.version} diff --git a/packaging/hoodie-spark-bundle/pom.xml b/packaging/hoodie-spark-bundle/pom.xml index f1e3cee..b95554c 100644 --- a/packaging/hoodie-spark-bundle/pom.xml +++ b/packaging/hoodie-spark-bundle/pom.xml @@ -177,6 +177,15 @@ org.apache.spark:* + + + +META-INF/*.SF +META-INF/*.DSA +META-INF/*.RSA + + + ${project.artifactId}-${project.version}
[GitHub] [incubator-hudi] bvaradar merged pull request #669: HUDI-101: added exclusion filters for signature files.
bvaradar merged pull request #669: HUDI-101: added exclusion filters for signature files. URL: https://github.com/apache/incubator-hudi/pull/669 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] pseudomuto opened a new pull request #670: SparkUtil#initLauncher shoudn't raise when spark-defaults.conf doesn't exist
pseudomuto opened a new pull request #670: SparkUtil#initLauncher shoudn't raise when spark-defaults.conf doesn't exist URL: https://github.com/apache/incubator-hudi/pull/670 @vinothchandar @n3nash # The Problem When attempting to run the `hdfsparquetimport` command from the hoode-cli I ran into an issue where it raises when the file _spark-defaults.conf_ is not present. The `SparkLauncher` has a [check](https://github.com/apache/spark/blob/5e79ae3b40b76e3473288830ab958fc4834dcb33/launcher/src/main/java/org/apache/spark/launcher/AbstractLauncher.java#L45) that raises when this value is `null` which is the case when using a default spark download for example. # The Solution Since I couldn't find any reason to require this to be present, I simply added a `isEmpty` check to only set the properties file when the value isn't `null` or `""`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] abhioncbr opened a new pull request #669: HUDI-101: added exclusion filters for signature files.
abhioncbr opened a new pull request #669: HUDI-101: added exclusion filters for signature files. URL: https://github.com/apache/incubator-hudi/pull/669 @bvaradar Updated individual pom files with the exclusion filter. Tested, there are no signature files and also Demo steps are working absolutely fine. Please review. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #143: Tracking ticket for folks to be added to slack group
vinothchandar commented on issue #143: Tracking ticket for folks to be added to slack group URL: https://github.com/apache/incubator-hudi/issues/143#issuecomment-490118753 done.. Also please join our mailing list. https://hudi.apache.org/community.html ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing
vinothchandar commented on issue #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing URL: https://github.com/apache/incubator-hudi/pull/666#issuecomment-490112102 >there is one on size and both of them are close to 2MB, I actually rounded them off to the near megabyte, there may be differences in kilobytes. Can we test with N=`50` fp=`0.1` and 10x/100x that? I think that will produce larger sizes/more fps. I would be surprised if dynamic provides much less fp's with same number of bits. All it must be doing is to use more bits as more entries come in. you can use something like https://krisives.github.io/bloom-calculator/ to design a case around this.. I think we have to do option 1 right? In option 2 also we 'd be reading old and new files with different filter formats right? do we handle an exception and detect dynamic vs normal bf? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing
vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing URL: https://github.com/apache/incubator-hudi/pull/666#discussion_r281669970 ## File path: hoodie-common/src/main/java/com/uber/hoodie/common/BloomFilter.java ## @@ -37,33 +38,49 @@ */ public static final double LOG2_SQUARED = Math.log(2) * Math.log(2); - private org.apache.hadoop.util.bloom.BloomFilter filter = null; + private org.apache.hadoop.util.bloom.BloomFilter bloomFilter = null; - public BloomFilter(int numEntries, double errorRate) { -this(numEntries, errorRate, Hash.MURMUR_HASH); + private org.apache.hadoop.util.bloom.DynamicBloomFilter dynamicBloomFilter = null; Review comment: https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/bloomfilter/DynamicBloomFilter.java also has one.. It'd be good to know tradeoffs each made. esp HBase and Accumulo This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing
vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing URL: https://github.com/apache/incubator-hudi/pull/666#discussion_r281670956 ## File path: hoodie-client/src/main/java/com/uber/hoodie/io/storage/HoodieStorageWriterFactory.java ## @@ -41,7 +41,7 @@ R extends IndexedRecord> HoodieStorageWriter newParquetStorageWriter(String commitTime, Path path, HoodieWriteConfig config, Schema schema, HoodieTable hoodieTable) throws IOException { BloomFilter filter = new BloomFilter(config.getBloomFilterNumEntries(), -config.getBloomFilterFPP()); +config.getBloomFilterFPP(), false); Review comment: @n3nash without the config here, actually we would not have written dynamic filters at all during the tests? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar edited a comment on issue #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing
vinothchandar edited a comment on issue #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing URL: https://github.com/apache/incubator-hudi/pull/666#issuecomment-490112102 >there is one on size and both of them are close to 2MB, I actually rounded them off to the near megabyte, there may be differences in kilobytes. Can we test with N=`50` fp=`0.1` and 10x/100x that? I think that will produce larger sizes/more fps. I would be surprised if dynamic provides much less fp's with same number of bits. All it must be doing is to use more bits as more entries come in. you can use something like https://krisives.github.io/bloom-calculator/ to design a case around this.. If proven to work, yes we should enable DynamicBloom by default. I think we have to do option 1 right? In option 2 also we 'd be reading old and new files with different filter formats right? do we handle an exception and detect dynamic vs normal bf? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on issue #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing
vinothchandar commented on issue #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing URL: https://github.com/apache/incubator-hudi/pull/666#issuecomment-490112605 Have you serialized both the filters and see if the you can read a serialized DynamicBloomFilter as a BloomFilter? (Wishful thinking. but still could simplify things a lot if true) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing
vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing URL: https://github.com/apache/incubator-hudi/pull/666#discussion_r281664118 ## File path: hoodie-common/src/main/java/com/uber/hoodie/common/BloomFilter.java ## @@ -37,33 +38,49 @@ */ public static final double LOG2_SQUARED = Math.log(2) * Math.log(2); - private org.apache.hadoop.util.bloom.BloomFilter filter = null; + private org.apache.hadoop.util.bloom.BloomFilter bloomFilter = null; - public BloomFilter(int numEntries, double errorRate) { -this(numEntries, errorRate, Hash.MURMUR_HASH); + private org.apache.hadoop.util.bloom.DynamicBloomFilter dynamicBloomFilter = null; Review comment: Accumulo seems to be implementing something.. https://github.com/apache/accumulo/blob/master/core/src/main/java/org/apache/accumulo/core/bloomfilter/DynamicBloomFilter.java as well. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing
vinothchandar commented on a change in pull request #666: Add support for dynamic bloom filter to increase efficiency of bloom filter for static sizing URL: https://github.com/apache/incubator-hudi/pull/666#discussion_r281663186 ## File path: hoodie-client/src/main/java/com/uber/hoodie/config/HoodieIndexConfig.java ## @@ -50,6 +50,9 @@ public static final String BLOOM_INDEX_INPUT_STORAGE_LEVEL = "hoodie.bloom.index.input.storage" + ".level"; public static final String DEFAULT_BLOOM_INDEX_INPUT_STORAGE_LEVEL = "MEMORY_AND_DISK_SER"; + public static final String BLOOM_INDEX_ENABLE_DYNAMIC_PROP = Review comment: depends on how you look at it.. At the code level, its weird to suddenly access an index config in storage.. we can leave it here for now. but let's rename? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services