[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

Dmitry Zanozin (JIRA) Tue, 11 Sep 2018 02:12:34 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16610330#comment-16610330
 ]


Dmitry Zanozin commented on SPARK-23986:
----------------------------------------

Well, it's a bit hard to provide the exact code we use when this failure 
occurs, many aggregated attributes with several custom UDFs. The same code 
works fine with a different dataframe which differs from the first one by 
having 1 string attribute instead of 4 int ones (among other 22 attributes). 
Here is a DF schema which fails:
{noformat}
scala> df.printSchema()
root
 |-- attr1: timestamp (nullable = true)
 |-- attr2: string (nullable = true)
 |-- attr3: integer (nullable = true)
 |-- attr4: integer (nullable = true)
 |-- attr5: string (nullable = true)
 |-- attr6: integer (nullable = true)
 |-- attr7: byte (nullable = true)
 |-- attr8: integer (nullable = true)
 |-- attr9: timestamp (nullable = true)
 |-- attr10: string (nullable = true)
 |-- attr11: byte (nullable = true)
 |-- attr12: short (nullable = true)
 |-- attr13: short (nullable = true)
 |-- attr14: integer (nullable = true)
 |-- attr15: integer (nullable = true)
 |-- attr16: short (nullable = true)
 |-- attr17: integer (nullable = true)
 |-- attr18: string (nullable = true)
 |-- attr19: byte (nullable = true)
 |-- attr20: byte (nullable = true)
 |-- attr21: integer (nullable = true)
 |-- attr22: date (nullable = true)
{noformat}
if we replace attr12-attr15 (which represent the unique record key and are part 
of a group by statement) with a single string attr (a primary key of another 
object type) the code works fine.

Information about the execution environment:
{noformat}
spark-2.3.1-bin-without-hadoop
hadoop-2.8.4
Scala 2.11.8
Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162
{noformat}

> CompileException when using too many avg aggregation after joining
> ------------------------------------------------------------------
>
>                 Key: SPARK-23986
>                 URL: https://issues.apache.org/jira/browse/SPARK-23986
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.0
>            Reporter: Michel Davit
>            Assignee: Marco Gaido
>            Priority: Major
>             Fix For: 2.3.1, 2.4.0
>
>         Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
>     val df1: DataFrame = sparkSession.sparkContext
>       .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>       .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
>     val df2: DataFrame = sparkSession.sparkContext
>       .makeRDD(Seq((0, "val1", "val2")))
>       .toDF("key", "dummy1", "dummy2")
>     val agg = df1
>       .join(df2, df1("key") === df2("key"), "leftouter")
>       .groupBy(df1("key"))
>       .agg(
>         avg("col2").as("avg2"),
>         avg("col3").as("avg3"),
>         avg("col4").as("avg4"),
>         avg("col1").as("avg1"),
>         avg("col5").as("avg5"),
>         avg("col6").as("avg6")
>       )
>     val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>    * Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>    */
>   def freshName(name: String): String = synchronized {
>     val fullName = if (freshNamePrefix == "") {
>       name
>     } else {
>       s"${freshNamePrefix}_$name"
>     }
>     if (freshNameIds.contains(fullName)) {
>       val id = freshNameIds(fullName)
>       freshNameIds(fullName) = id + 1
>       s"$fullName$id"
>     } else {
>       freshNameIds += fullName -> 1
>       fullName
>     }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

Reply via email to