Github user wangyum commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21603#discussion_r197011649
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
    @@ -270,6 +270,11 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
           case sources.Not(pred) =>
             createFilter(schema, pred).map(FilterApi.not)
     
    +      case sources.In(name, values) if canMakeFilterOn(name) && 
values.length < 20 =>
    --- End diff --
    
    The threshold is **20**. Too many `values` may be OOM, for example:
    ```scala
    spark.range(10000000).coalesce(1).write.option("parquet.block.size", 
1048576).parquet("/tmp/spark/parquet/SPARK-17091")
    val df = spark.read.parquet("/tmp/spark/parquet/SPARK-17091/")
    df.where(s"id in(${Range(1, 10000).mkString(",")})").count
    ```
    ```
    Exception in thread "SIGINT handler" 18/06/21 13:00:54 WARN TaskSetManager: 
Lost task 7.0 in stage 1.0 (TID 8, localhost, executor driver): 
java.lang.OutOfMemoryError: Java heap space
            at java.util.Arrays.copyOfRange(Arrays.java:3664)
            at java.lang.String.<init>(String.java:207)
            at java.lang.StringBuilder.toString(StringBuilder.java:407)
            at 
org.apache.parquet.filter2.predicate.Operators$BinaryLogicalFilterPredicate.<init>(Operators.java:263)
            at 
org.apache.parquet.filter2.predicate.Operators$Or.<init>(Operators.java:316)
            at 
org.apache.parquet.filter2.predicate.FilterApi.or(FilterApi.java:261)
            at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$15.apply(ParquetFilters.scala:276)
            at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFilters$$anonfun$createFilter$15.apply(ParquetFilters.scala:276)
    ...
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to