[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query performance

Yahui Liu (Jira) Mon, 24 May 2021 01:31:04 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-35500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yahui Liu updated SPARK-35500:
------------------------------
    Description: 
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array<int>);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  
{code:java}
object MapObjects {
  private val curId = new java.util.concurrent.atomic.AtomicInteger()
 val id = curId.getAndIncrement()
 val loopValue = s"MapObjects_loopValue$id"
 val loopIsNull = if (elementNullable) {
   s"MapObjects_loopIsNull$id"
 } else {
   "false"
 }
{code}
First time run: 
 # 
 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 
 # 

 

         First time run:

          class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {

                    private int MapObjects_loopValue{color:#FF0000}1{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF0000}1{color};
                    private UTF8String 
MapObjects_loopValue{color:#FF0000}2{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF0000}2{color};

          }

         Second time run:



          class SpecificSafeProjection extends 
org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {

                    private int MapObjects_loopValue{color:#FF0000}3{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF0000}3{color};
                    private UTF8String 
MapObjects_loopValue{color:#FF0000}4{color};
                    private boolean 
MapObjects_loopIsNull{color:#FF0000}4{color};

          }

  was:
Reproduce steps:
 # create a new table with array type: create table test_code_gen(a array<int>);
 # Add 
log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator = 
DEBUG to log4j.properties;
 # Enter spark-shell, fire a query: spark.sql("select * from 
test_code_gen").collect
 # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
generated, but the code for the class cannot be reused because everytime the id 
for two variables in the generated class is changed: MapObjects_loopValue and 
MapObjects_loopIsNull. So even the class generated before has been cached, new 
code cannot match the cache key so that new code need to be compiled again 
which cost some time.  

object MapObjects {
 private val curId = new java.util.concurrent.atomic.AtomicInteger()
 # The time cost for compile is increasing with the growth of column number, 
for wide table, this cost can more than 2s. 


> GenerateSafeProjection.generate will generate SpecificSafeProjection class, 
> but if column is array type or map type, the code cannot be reused which 
> impact the query performance
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-35500
>                 URL: https://issues.apache.org/jira/browse/SPARK-35500
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.1.0
>            Reporter: Yahui Liu
>            Priority: Minor
>              Labels: codegen
>
> Reproduce steps:
>  # create a new table with array type: create table test_code_gen(a 
> array<int>);
>  # Add 
> log4j.logger.org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator 
> = DEBUG to log4j.properties;
>  # Enter spark-shell, fire a query: spark.sql("select * from 
> test_code_gen").collect
>  # Everytime, Dataset.collect is called, SpecificSafeProjection class is 
> generated, but the code for the class cannot be reused because everytime the 
> id for two variables in the generated class is changed: MapObjects_loopValue 
> and MapObjects_loopIsNull. So even the class generated before has been 
> cached, new code cannot match the cache key so that new code need to be 
> compiled again which cost some time.  
> {code:java}
> object MapObjects {
>   private val curId = new java.util.concurrent.atomic.AtomicInteger()
>  val id = curId.getAndIncrement()
>  val loopValue = s"MapObjects_loopValue$id"
>  val loopIsNull = if (elementNullable) {
>    s"MapObjects_loopIsNull$id"
>  } else {
>    "false"
>  }
> {code}
> First time run: 
>  # 
>  # The time cost for compile is increasing with the growth of column number, 
> for wide table, this cost can more than 2s. 
>  # 
>  
>          First time run:
>           class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>                     private int MapObjects_loopValue{color:#FF0000}1{color};
>                     private boolean 
> MapObjects_loopIsNull{color:#FF0000}1{color};
>                     private UTF8String 
> MapObjects_loopValue{color:#FF0000}2{color};
>                     private boolean 
> MapObjects_loopIsNull{color:#FF0000}2{color};
>           }
>          Second time run:
>           class SpecificSafeProjection extends 
> org.apache.spark.sql.catalyst.expressions.codegen.BaseProjection {
>                     private int MapObjects_loopValue{color:#FF0000}3{color};
>                     private boolean 
> MapObjects_loopIsNull{color:#FF0000}3{color};
>                     private UTF8String 
> MapObjects_loopValue{color:#FF0000}4{color};
>                     private boolean 
> MapObjects_loopIsNull{color:#FF0000}4{color};
>           }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35500) GenerateSafeProjection.generate will generate SpecificSafeProjection class, but if column is array type or map type, the code cannot be reused which impact the query performance

Reply via email to