[ https://issues.apache.org/jira/browse/SPARK-36733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17413882#comment-17413882 ]
Kohki Nishio commented on SPARK-36733: -------------------------------------- [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SchemaPruning.scala#L69] often time (as long as I observed), left struct and the right struct are the same one. And every call to {{StructType.fieldNames}} runs\{{ fields.map(_.name). }} this computation is quite expensive for 10K fields. {{ val filteredRightFieldNames = rightStruct.fieldNames}} {{ .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))}}{{ }} > Perf issue in SchemaPruning when a struct has million fields > ------------------------------------------------------------ > > Key: SPARK-36733 > URL: https://issues.apache.org/jira/browse/SPARK-36733 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.2 > Reporter: Kohki Nishio > Priority: Major > > Seeing a significant performance degradation in query processing when a table > contains a significantly large number of fields (>10K). > Here's the stacktraces while processing a query > {code:java} > java.lang.Thread.State: RUNNABLE java.lang.Thread.State: RUNNABLE at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:285) at > scala.collection.TraversableLike$$Lambda$296/874023329.apply(Unknown Source) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.map(TraversableLike.scala:285) at > scala.collection.TraversableLike.map$(TraversableLike.scala:278) at > scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) at > org.apache.spark.sql.types.StructType.fieldNames(StructType.scala:108) at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$1$adapted(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3963/249742655.apply(Unknown > Source) at > scala.collection.TraversableLike.$anonfun$filterImpl$1(TraversableLike.scala:303) > at scala.collection.TraversableLike$$Lambda$403/465534593.apply(Unknown > Source) at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > scala.collection.TraversableLike.filterImpl(TraversableLike.scala:302) at > scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:296) at > scala.collection.mutable.ArrayOps$ofRef.filterImpl(ArrayOps.scala:198) at > scala.collection.TraversableLike.filter(TraversableLike.scala:394) at > scala.collection.TraversableLike.filter$(TraversableLike.scala:394) at > scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:198) at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.sortLeftFieldsByRight(SchemaPruning.scala:70) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$anonfun$sortLeftFieldsByRight$3(SchemaPruning.scala:75) > at > org.apache.spark.sql.catalyst.expressions.SchemaPruning$$$Lambda$3965/461314749.apply(Unknown > Source) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org