[ https://issues.apache.org/jira/browse/SPARK-25209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Herman van Hovell resolved SPARK-25209. --------------------------------------- Resolution: Fixed Assignee: Bogdan Raducanu Fix Version/s: 2.4.0 > Optimization in Dataset.apply for DataFrames > -------------------------------------------- > > Key: SPARK-25209 > URL: https://issues.apache.org/jira/browse/SPARK-25209 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 2.3.1 > Reporter: Bogdan Raducanu > Assignee: Bogdan Raducanu > Priority: Major > Fix For: 2.4.0 > > > {{Dataset.apply}} calls {{dataset.deserializer}} (to provide an early error) > which ends up calling the full {{Analyzer}} on the deserializer. This can > take tens of milliseconds, depending on how big the plan is. > Since {{Dataset.apply}} is called for many {{Dataset}} operations such as > {{Dataset.where}} it can be a significant overhead for short queries. > In the following code: {{duration}} is *17 ms* in current spark *vs 1 ms* > if I remove the line {{dataset.deserializer}}. > It seems the resulting {{deserializer}} is particularly big in the case of > nested schema, but the same overhead can be observed if we have a very wide > flat schema. > According to a comment in the PR that introduced this check, we can at least > remove this check for {{DataFrames}}: > [https://github.com/apache/spark/pull/20402#discussion_r164338267] > {code} > val col = "named_struct(" + > (0 until 100).map { i => s"'col$i', id"}.mkString(",") + ")" > val df = spark.range(10).selectExpr(col) > val TRUE = lit(true) > val numIter = 1000 > var startTime = System.nanoTime() > for(i <- 0 until numIter) { > df.where(TRUE) > } > val durationMs = (System.nanoTime() - startTime) / numIter / 1000000 > println(s"duration $durationMs") > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org