The IN clause tends to have a limit (depends on the datasource). I'm not that sure with concatenating ORs.
El mié, 1 oct 2025, 20:48, Asif Shahid <[email protected]> escribió: > My take: > OR will result in lining of the OR conditions , which means no Map > lookup. So I suppose it would save on memory associated with Map creations > ( & that too I suppose per partition ) and the lookup costs, when > implemented using IN > May be there are other reasons which I do not know... > > Regards > Asif > > On Tue, Sep 30, 2025 at 1:37 PM Yian Liou <[email protected]> > wrote: > >> Hi everyone, >> >> I am looking to increasing the value of the config >> spark.sql.parquet.pushdown.inFilterThreshold to boost performance for some >> queries I am looking at. While looking at the implementation in the Spark >> Repo at >> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L798 >> with >> the following code snippet >> >> case sources.In(name, values) if pushDownInFilterThreshold > 0 && >> values.nonEmpty && >> canMakeFilterOn(name, values.head) => >> val fieldType = nameToParquetField(name).fieldType >> val fieldNames = nameToParquetField(name).fieldNames >> if (values.length <= pushDownInFilterThreshold) { >> values.distinct.flatMap { v => >> makeEq.lift(fieldType).map(_(fieldNames, v)) >> }.reduceLeftOption(FilterApi.or) >> } else if (canPartialPushDownConjuncts) { >> if (values.contains(null)) { >> Seq(makeEq.lift(fieldType).map(_(fieldNames, null)), >> makeInPredicate.lift(fieldType).map(_(fieldNames, >> values.filter(_ != null))) >> ).flatten.reduceLeftOption(FilterApi.or) >> } else { >> makeInPredicate.lift(fieldType).map(_(fieldNames, values)) >> } >> } else { >> None >> } >> >> I see that when the number of items is less than or equal to >> spark.sql.parquet.pushdown.inFilterThreshold in ParquetFilters.scala, >> Parquet pushes ORs rather than an IN predicate. What are the advantages of >> doing so? >> >> Best Regards, >> Yian >> >
