[ https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17350458#comment-17350458 ]
Takeshi Yamamuro commented on SPARK-35500: ------------------------------------------ Which version did you use? v3.1.0 does not exist, so v3.1.1? I tried to run the queries in v3.1.1 to reproduce it, but it couldn't happen; {code:java} Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.1.1 /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_181) scala> sql("create table test_code_gen(a array<int>)") scala> sql("insert into test_code_gen values (array(1, 1))") scala> sc.setLogLevel("debug") // The first run scala> sql("select * from test_code_gen").collect() ... 21/05/24 23:14:00 DEBUG GenerateSafeProjection: code for createexternalrow(staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(lambdavariable(MapObject, IntegerType, true, -1), lambdavariable(MapObject, IntegerType, true, -1), input[0, array<int>, true], None).array, true, false), StructField(a,ArrayType(IntegerType,true),true)): /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificSafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private boolean resultIsNull_0; /* 010 */ private int value_MapObject_lambda_variable_1; /* 011 */ private boolean isNull_MapObject_lambda_variable_1; /* 012 */ private boolean globalIsNull_0; /* 013 */ private java.lang.Object[] mutableStateArray_0 = new java.lang.Object[1]; /* 014 */ ... // The second run scala> sql("select * from test_code_gen").collect() ... 21/05/24 23:14:28 DEBUG GenerateSafeProjection: code for createexternalrow(staticinvoke(class scala.collection.mutable.WrappedArray$, ObjectType(interface scala.collection.Seq), make, mapobjects(lambdavariable(MapObject, IntegerType, true, -1), lambdavariable(MapObject, IntegerType, true, -1), input[0, array<int>, true], None).array, true, false), StructField(a,ArrayType(IntegerType,true),true)): /* 001 */ public java.lang.Object generate(Object[] references) { /* 002 */ return new SpecificSafeProjection(references); /* 003 */ } /* 004 */ /* 005 */ class SpecificSafeProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { /* 006 */ /* 007 */ private Object[] references; /* 008 */ private InternalRow mutableRow; /* 009 */ private boolean resultIsNull_0; /* 010 */ private int value_MapObject_lambda_variable_1; /* 011 */ private boolean isNull_MapObject_lambda_variable_1; /* 012 */ private boolean globalIsNull_0; /* 013 */ private java.lang.Object[] mutableStateArray_0 = new java.lang.Object[1]; ... {code} Actually, this issue should be fixed in SPARK-27871([https://github.com/apache/spark/pull/24735]). Or, do I miss something? > GenerateSafeProjection.generate will generate SpecificSafeProjection class, > but if column is array type or map type, the code cannot be reused which > impact the query performance > --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-35500 > URL: https://issues.apache.org/jira/browse/SPARK-35500 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.1.0 > Reporter: Yahui Liu > Priority: Minor > Labels: codegen > > Reproduce steps: > # create a new table with array type: create table test_code_gen(a > array<int>); > # Add > log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator > = DEBUG to log4j.properties; > # Enter spark-shell, fire a query: spark.sql("select * from > test_code_gen").collect > # Everytime, Dataset.collect is called, SpecificSafeProjection class is > generated, but the code for the class cannot be reused because everytime the > id for two variables in the generated class is changed: MapObjects_loopValue > and MapObjects_loopIsNull. So even the class generated before has been > cached, new code cannot match the cache key so that new code need to be > compiled again which cost some time. The time cost for compile is increasing > with the growth of column number, for wide table, this cost can more than 2s. > {code:java} > object MapObjects { > private val curId = new java.util.concurrent.atomic.AtomicInteger() > val id = curId.getAndIncrement() > val loopValue = s"MapObjects_loopValue$id" > val loopIsNull = if (elementNullable) { > s"MapObjects_loopIsNull$id" > } else { > "false" > } > {code} > First time run: > {code:java} > class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > private int MapObjects_loopValue1; > private boolean MapObjects_loopIsNull1; > private UTF8String MapObjects_loopValue2; > private boolean MapObjects_loopIsNull2; > } > {code} > Second time run: > {code:java} > class SpecificSafeProjection extends > org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection { > private int MapObjects_loopValue3; > private boolean MapObjects_loopIsNull3; > private UTF8String MapObjects_loopValue4; > private boolean MapObjects_loopIsNull4; > }{code} > Expectation: > The code generated by GenerateSafeProjection can be reused if the query is > same. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org