Hiroshi Inoue created SPARK-16331: ------------------------------------- Summary: [SQL] Reduce code generation time Key: SPARK-16331 URL: https://issues.apache.org/jira/browse/SPARK-16331 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.0.0, 2.1.0 Reporter: Hiroshi Inoue
During the code generation, a {{LocalRelation}} often has a huge {{Vector}} object as {{data}}. In the simple example below, a {{LocalRelation}} has a Vector with 1000000 elements of {{UnsafeRow}}. {quote} val numRows = 1000000 val ds = (1 to numRows).toDS().persist() benchmark.addCase("filter+reduce") { iter => ds.filter(a => (a & 1) == 0).reduce(_ + _) } {quote} At {{TreeNode.transformChildren}}, all elements of the vector is unnecessarily iterated to check whether any children exist in the vector since {{Vector}} is Traversable. This part significantly increases code generation time. This patch avoids this overhead by checking the number of children before iterating all elements; {{LocalRelation}} does not have children since it extends {{LeafNode}}. The performance of the above example {quote} without this patch Java HotSpot(TM) 64-Bit Server VM 1.8.0_91-b14 on Mac OS X 10.11.5 Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz compilationTime: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ filter+reduce 4426 / 4533 0.2 4426.0 1.0X with this patch compilationTime: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ filter+reduce 3117 / 3391 0.3 3116.6 1.0X {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org