Re: [Spark SQL] Improving query Performance by increasing spark.sql.parquet.pushdown.inFilterThreshold config

Yuming Wang Thu, 02 Oct 2025 09:50:52 -0700

Please read this comment:
https://github.com/apache/spark/pull/36696#pullrequestreview-987216872


On Thu, Oct 2, 2025 at 3:56 AM Ángel Álvarez Pascua <
[email protected]> wrote:

> The IN clause tends to have a limit (depends on the datasource). I'm not
> that sure with concatenating ORs.
>
> El mié, 1 oct 2025, 20:48, Asif Shahid <[email protected]> escribió:
>
>> My take:
>> OR will result in  lining of the OR conditions , which means no Map
>> lookup. So I suppose it would save on memory associated with Map creations
>> ( & that too I suppose per partition )  and the lookup costs, when
>> implemented using IN
>> May be there are other reasons which I do not know...
>>
>> Regards
>> Asif
>>
>> On Tue, Sep 30, 2025 at 1:37 PM Yian Liou <[email protected]>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I am looking to increasing the value of the config
>>> spark.sql.parquet.pushdown.inFilterThreshold to boost performance for some
>>> queries I am looking at. While looking at the implementation in the Spark
>>> Repo at
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L798
>>>  with
>>> the following code snippet
>>>
>>> case sources.In(name, values) if pushDownInFilterThreshold > 0 &&
>>> values.nonEmpty &&
>>>           canMakeFilterOn(name, values.head) =>
>>>         val fieldType = nameToParquetField(name).fieldType
>>>         val fieldNames = nameToParquetField(name).fieldNames
>>>         if (values.length <= pushDownInFilterThreshold) {
>>>           values.distinct.flatMap { v =>
>>>             makeEq.lift(fieldType).map(_(fieldNames, v))
>>>           }.reduceLeftOption(FilterApi.or)
>>>         } else if (canPartialPushDownConjuncts) {
>>>           if (values.contains(null)) {
>>>             Seq(makeEq.lift(fieldType).map(_(fieldNames, null)),
>>>               makeInPredicate.lift(fieldType).map(_(fieldNames,
>>> values.filter(_ != null)))
>>>             ).flatten.reduceLeftOption(FilterApi.or)
>>>           } else {
>>>             makeInPredicate.lift(fieldType).map(_(fieldNames, values))
>>>           }
>>>         } else {
>>>           None
>>>         }
>>>
>>>  I see that when the number of items is less than or equal to
>>> spark.sql.parquet.pushdown.inFilterThreshold in ParquetFilters.scala,
>>> Parquet pushes ORs rather than an IN predicate. What are the advantages of
>>> doing so?
>>>
>>> Best Regards,
>>> Yian
>>>
>>

Re: [Spark SQL] Improving query Performance by increasing spark.sql.parquet.pushdown.inFilterThreshold config

Reply via email to