[ https://issues.apache.org/jira/browse/APEXMALHAR-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16040534#comment-16040534 ]
ASF GitHub Bot commented on APEXMALHAR-2366: -------------------------------------------- GitHub user PramodSSImmaneni reopened a pull request: https://github.com/apache/apex-malhar/pull/631 APEXMALHAR-2366 #resolve #comment Apply BloomFilter to Bucket, use internal BloomFilter @bhupeshchawda please see, this is to finish up the work started in https://github.com/apache/apex-malhar/pull/521 You can merge this pull request into a Git repository by running: $ git pull https://github.com/PramodSSImmaneni/apex-malhar APEXMALHAR-2366 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/apex-malhar/pull/631.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #631 ---- commit 3c3a01777329252aaa46a39e52ab9a190dbfb74f Author: brightchen <bri...@datatorrent.com> Date: 2016-12-05T19:34:48Z APEXMALHAR-2366 #resolve #comment Apply BloomFilter to Bucket, use internal BloomFilter commit e08ccd091ff23eac38ecf7997230262c772cbdc1 Author: Pramod Immaneni <pra...@datatorrent.com> Date: 2017-06-01T21:02:26Z Added license references, this closes #521 ---- > Apply BloomFilter to Bucket > --------------------------- > > Key: APEXMALHAR-2366 > URL: https://issues.apache.org/jira/browse/APEXMALHAR-2366 > Project: Apache Apex Malhar > Issue Type: Improvement > Reporter: bright chen > Assignee: bright chen > Original Estimate: 192h > Remaining Estimate: 192h > > The bucket get() will check the cache and then check from the stored files if > the entry is not in the cache. The checking from files is a pretty heavy > operation due to file seek. > The chance of check from file is very high if the key range are large. > Suggest to apply BloomFilter for bucket to reduce the chance read from file. > If the buckets were managed by ManagedStateImpl, the entry of bucket would be > very huge and the BloomFilter maybe not useful after a while. But If the > buckets were managed by ManagedTimeUnifiedStateImpl, each bucket keep certain > amount of entry and BloomFilter would be very useful. > For implementation: > The Guava already have BloomFilter and the interface are pretty simple and > fit for our case. But Guava 11 is not compatible with Guava 14 (Guava 11 use > Sink while Guava 14 use PrimitiveSink). -- This message was sent by Atlassian JIRA (v6.3.15#6346)