[ https://issues.apache.org/jira/browse/SPARK-3711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152878#comment-14152878 ]
Yash Datta commented on SPARK-3711: ----------------------------------- On a 2 node setup each machine config: 24 core machine , (96 GB ) invoking spark-sql with : ./bin/spark-sql --executor-memory 16G --driver-memory 8G --master <url> executing a filter query on a parquet table having 47750544 rows , with ~1000 filters: (the selected column was unique for each row) select * from <table> where <column> in (A1,A2....A1000); Time taken on spark-1.1 (after multiple runs) : ~90 seconds after the patch : ~7 seconds > Optimize where in clause filter queries > --------------------------------------- > > Key: SPARK-3711 > URL: https://issues.apache.org/jira/browse/SPARK-3711 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.1.0 > Reporter: Yash Datta > Priority: Minor > Fix For: 1.1.1 > > > The In case class is replaced by a InSet class in case all the filters are > literals, which uses a hashset instead of Sequence, thereby giving > significant performance improvement. Maximum improvement should be visible in > case small percentage of large data matches the filter list -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org