huaxingao commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r714379682



##########
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -293,8 +290,48 @@ boolean isNullPage(int pageIndex) {
 
     @Override
     public <T extends Comparable<T>> PrimitiveIterator.OfInt visit(In<T> in) {
-      IntSet indexes = getMatchingIndexes(in);
-      return IndexIterator.filter(getPageCount(), indexes::contains);
+      Set<T> values = in.getValues();
+      TreeSet<T> myTreeSet = new TreeSet<>();
+      IntSet matchingIndexes1 = new IntOpenHashSet();  // for null
+      Iterator<T> it = values.iterator();
+      while(it.hasNext()) {
+        T value = it.next();
+        if (value != null) {
+          myTreeSet.add(value);
+        } else {
+          if (nullCounts == null) {
+            // Searching for nulls so if we don't have null related statistics 
we have to return all pages
+            return IndexIterator.all(getPageCount());
+          } else {
+            for (int i = 0; i < nullCounts.length; i++) {
+              if (nullCounts[i] > 0) {
+                matchingIndexes1.add(i);
+              }
+            }
+          }
+        }
+      }
+
+      IntSet matchingIndexes2 = new IntOpenHashSet();

Review comment:
       @gszadovszky Sorry for the delay. I did the following changes:
   1. Removed `TreeSet` to avoid sort the whole set, and added a method to get 
max and min.
   2. Updated `StatisticsFilter` to have the similar range comparison.
   3. I actually didn't change `notIn` because I don't think range checking 
works with `notIn`: if not in the range, `notIn` is true, but if in the range, 
it doesn't mean `notIn` is false.  For example, if we have `1, 2, 3, 6, 7 8, 9, 
10` and the `notIn` predicate has set values `4, 5`, `4, 5` is in the range of 
1 to 10, but `1, 2, 3, 6, 7 8, 9, 10` doesn't contain `4, 5`. In 
`StatisticsFilter` I simply return `BLOCK_MIGHT_MATCH;` for `notIn`. I probably 
should return `IndexIterator.all(getPageCount());` in `ColumnIndexBuilder` to 
be consistent with `StatisticsFilter`.

##########
File path: 
parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
##########
@@ -1,14 +1,14 @@
-/* 

Review comment:
       @gszadovszky I added a test in `TestRecordLevelFilters.java` to test the 
new methods in the generated class 
`IncrementallyUpdatedFilterPredicateBuilder`. I have added tests in 
`TestColumnIndexFiltering` and `TestBloomFiltering` in my original changes. Do 
I need more tests in these two files?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to