Re: [Spark SQL] Improving query Performance by increasing spark.sql.parquet.pushdown.inFilterThreshold config

Ángel Álvarez Pascua Wed, 01 Oct 2025 12:34:45 -0700

The IN clause tends to have a limit (depends on the datasource). I'm not
that sure with concatenating ORs.


El mié, 1 oct 2025, 20:48, Asif Shahid <[email protected]> escribió:

> My take:
> OR will result in  lining of the OR conditions , which means no Map
> lookup. So I suppose it would save on memory associated with Map creations
> ( & that too I suppose per partition )  and the lookup costs, when
> implemented using IN
> May be there are other reasons which I do not know...
>
> Regards
> Asif
>
> On Tue, Sep 30, 2025 at 1:37 PM Yian Liou <[email protected]>
> wrote:
>
>> Hi everyone,
>>
>> I am looking to increasing the value of the config
>> spark.sql.parquet.pushdown.inFilterThreshold to boost performance for some
>> queries I am looking at. While looking at the implementation in the Spark
>> Repo at
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L798
>>  with
>> the following code snippet
>>
>> case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
>> values.nonEmpty &&
>>           canMakeFilterOn(name, values.head) =>
>>         val fieldType = nameToParquetField(name).fieldType
>>         val fieldNames = nameToParquetField(name).fieldNames
>>         if (values.length <= pushDownInFilterThreshold) {
>>           values.distinct.flatMap { v =>
>>             makeEq.lift(fieldType).map(_(fieldNames, v))
>>           }.reduceLeftOption(FilterApi.or)
>>         } else if (canPartialPushDownConjuncts) {
>>           if (values.contains(null)) {
>>             Seq(makeEq.lift(fieldType).map(_(fieldNames, null)),
>>               makeInPredicate.lift(fieldType).map(_(fieldNames,
>> values.filter(_ != null)))
>>             ).flatten.reduceLeftOption(FilterApi.or)
>>           } else {
>>             makeInPredicate.lift(fieldType).map(_(fieldNames, values))
>>           }
>>         } else {
>>           None
>>         }
>>
>>  I see that when the number of items is less than or equal to
>> spark.sql.parquet.pushdown.inFilterThreshold in ParquetFilters.scala,
>> Parquet pushes ORs rather than an IN predicate. What are the advantages of
>> doing so?
>>
>> Best Regards,
>> Yian
>>
>

Re: [Spark SQL] Improving query Performance by increasing spark.sql.parquet.pushdown.inFilterThreshold config

Reply via email to