Please read this comment: https://github.com/apache/spark/pull/36696#pullrequestreview-987216872
On Thu, Oct 2, 2025 at 3:56 AM Ángel Álvarez Pascua < [email protected]> wrote: > The IN clause tends to have a limit (depends on the datasource). I'm not > that sure with concatenating ORs. > > El mié, 1 oct 2025, 20:48, Asif Shahid <[email protected]> escribió: > >> My take: >> OR will result in lining of the OR conditions , which means no Map >> lookup. So I suppose it would save on memory associated with Map creations >> ( & that too I suppose per partition ) and the lookup costs, when >> implemented using IN >> May be there are other reasons which I do not know... >> >> Regards >> Asif >> >> On Tue, Sep 30, 2025 at 1:37 PM Yian Liou <[email protected]> >> wrote: >> >>> Hi everyone, >>> >>> I am looking to increasing the value of the config >>> spark.sql.parquet.pushdown.inFilterThreshold to boost performance for some >>> queries I am looking at. While looking at the implementation in the Spark >>> Repo at >>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L798 >>> with >>> the following code snippet >>> >>> case sources.In(name, values) if pushDownInFilterThreshold > 0 && >>> values.nonEmpty && >>> canMakeFilterOn(name, values.head) => >>> val fieldType = nameToParquetField(name).fieldType >>> val fieldNames = nameToParquetField(name).fieldNames >>> if (values.length <= pushDownInFilterThreshold) { >>> values.distinct.flatMap { v => >>> makeEq.lift(fieldType).map(_(fieldNames, v)) >>> }.reduceLeftOption(FilterApi.or) >>> } else if (canPartialPushDownConjuncts) { >>> if (values.contains(null)) { >>> Seq(makeEq.lift(fieldType).map(_(fieldNames, null)), >>> makeInPredicate.lift(fieldType).map(_(fieldNames, >>> values.filter(_ != null))) >>> ).flatten.reduceLeftOption(FilterApi.or) >>> } else { >>> makeInPredicate.lift(fieldType).map(_(fieldNames, values)) >>> } >>> } else { >>> None >>> } >>> >>> I see that when the number of items is less than or equal to >>> spark.sql.parquet.pushdown.inFilterThreshold in ParquetFilters.scala, >>> Parquet pushes ORs rather than an IN predicate. What are the advantages of >>> doing so? >>> >>> Best Regards, >>> Yian >>> >>
