[ https://issues.apache.org/jira/browse/HUDI-686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17062761#comment-17062761 ]
lamber-ken commented on HUDI-686: --------------------------------- [~vinoth] thanks for bring up this new idea. here are some concerns to consider: 1. +candidates+ may cause OOM, although we can increase the num of partitions to solve it. that may will impact the user's experience, because use need think about it. {quote}List<Pair<HoodieRecord<T>, String>> candidates = new ArrayList<>(); {quote} 2. +fileIDToBloomFilter+ is an external map that spills content to disk, we need to think about the seri / dese performance {quote}this.fileIDToBloomFilter = new ExternalSpillableMap<>(1000000000L ...)BloomFilter filter = fileIDToBloomFilter.get(partitionFileIdPair.getRight()); {quote} [https://github.com/vinothchandar/incubator-hudi/blob/hudi-686-bloomindex-v2/hudi-client/src/main/java/org/apache/hudi/index/bloom/HoodieBloomIndexV2.java] {code:java} @Override protected List<Pair<HoodieRecord<T>, String>> computeNext() { List<Pair<HoodieRecord<T>, String>> candidates = new ArrayList<>(); if (inputItr.hasNext()) { HoodieRecord<T> record = inputItr.next(); try { initIfNeeded(record.getPartitionPath()); } catch (IOException e) { throw new HoodieIOException( "Error reading index metadata for " + record.getPartitionPath(), e); } indexFileFilter .getMatchingFilesAndPartition(record.getPartitionPath(), record.getRecordKey()) .forEach(partitionFileIdPair -> { BloomFilter filter = fileIDToBloomFilter.get(partitionFileIdPair.getRight()); if (filter.mightContain(record.getRecordKey())) { candidates.add(Pair.of(record, partitionFileIdPair.getRight())); } }); if (candidates.size() == 0) { candidates.add(Pair.of(record, "")); } } return candidates; } {code} > Implement BloomIndexV2 that does not depend on memory caching > ------------------------------------------------------------- > > Key: HUDI-686 > URL: https://issues.apache.org/jira/browse/HUDI-686 > Project: Apache Hudi (incubating) > Issue Type: Improvement > Components: Index, Performance > Reporter: Vinoth Chandar > Assignee: Vinoth Chandar > Priority: Major > Fix For: 0.6.0 > > > Main goals here is to provide a much simpler index, without advanced > optimizations like auto tuned parallelism/skew handling but a better > out-of-experience for small workloads. -- This message was sent by Atlassian Jira (v8.3.4#803005)