[ 
https://issues.apache.org/jira/browse/PARQUET-1968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17404591#comment-17404591
 ] 

ASF GitHub Bot commented on PARQUET-1968:
-----------------------------------------

gszadovszky commented on a change in pull request #923:
URL: https://github.com/apache/parquet-mr/pull/923#discussion_r695929108



##########
File path: pom.xml
##########
@@ -478,6 +478,7 @@
                 change to fix a integer overflow issue.
                 TODO: remove this after Parquet 1.13 release -->
               
<exclude>org.apache.parquet.column.values.dictionary.DictionaryValuesWriter#dictionaryByteSize</exclude>
+              
<exclude>org.apache.parquet.filter2.predicate.FilterPredicate</exclude>

Review comment:
       However, it is not what filter2 API is designed for, technically it is 
possible to the user to implement this interface and by adding new methods to 
it we really break our API.
   What do you think about adding default implementations by throwing 
`UnsupportedOperationException`? This way we do not need to add this class here.

##########
File path: 
parquet-column/src/main/java/org/apache/parquet/filter2/recordlevel/IncrementallyUpdatedFilterPredicate.java
##########
@@ -123,6 +124,46 @@ public boolean accept(Visitor visitor) {
     }
   }
 
+  abstract class SetInspector implements IncrementallyUpdatedFilterPredicate {

Review comment:
       I know it is quite a mass to add new stuff into 
`IncrementallyUpdatedFilterPredicateBuilder` (via 
`IncrementallyUpdatedFilterPredicateGenerator`) but the current way you are 
checking the values one-by-one while you have a hashset. It could be faster if 
the related visit methods would be generated in 
`IncrementallyUpdatedFilterPredicateBuilder` just like for the other predicates 
and hash search algorithm would be used. What do you think?

##########
File path: 
parquet-column/src/main/java/org/apache/parquet/internal/column/columnindex/ColumnIndexBuilder.java
##########
@@ -287,6 +291,27 @@ boolean isNullPage(int pageIndex) {
           pageIndex -> nullCounts[pageIndex] > 0 || 
matchingIndexes.contains(pageIndex));
     }
 
+    @Override

Review comment:
       I am not sure how it effects performance in real life (e.g. how many 
values are in the set of the in/notIn predicate and how many pages do we have) 
but it can be done in smarter way. Column indexes are min/max values for the 
pages. If we sort the values in the set we can do two logarithmic searches (one 
for min then for max) to decide if a page might contain that value. If the 
column indexes themselves have "sorted" min/max values then we can be even 
faster. (See `BoundaryOrder` for more details.)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@parquet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> FilterApi support In predicate
> ------------------------------
>
>                 Key: PARQUET-1968
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1968
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.12.0
>            Reporter: Yuming Wang
>            Priority: Major
>
> FilterApi should support native In predicate.
> Spark:
> https://github.com/apache/spark/blob/d6a68e0b67ff7de58073c176dd097070e88ac831/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L600-L605
> Impala:
> https://issues.apache.org/jira/browse/IMPALA-3654



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to