[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2019-05-24 Thread Siddharth Dangi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16847906#comment-16847906
 ] 

Siddharth Dangi commented on SPARK-23986:
-

[~pedromorfeu] I had 25 operations in the aggregation – once I split it up into 
5 groups of 5, it worked!  Thank you very much for your workaround and 
suggestion.

As a reference for anyone else who wants to try this – here is the workaround 
code that worked for me:
{code:java}
val sumColNames = List[String](...)  // list of 25 column names
val sumCols: List[Column] = sumColNames.map(name => sum(col(name)))

val grouped = input.groupBy(groupByColNames map col: _*)

/** only do 5 agg operations at a time, collect the results into a list,
  * and then use reduce to join them all together into one dataframe
  */
val step = 5
val output = List.range(0, sumCols.length, step)
  .map(idx => {
val cols = sumCols.slice(idx, idx + step)
grouped.agg(cols.head, cols.tail: _*)
  })
  .reduce((df1, df2) => df1.join(df2, groupByColNames)){code}
 

 

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2019-05-10 Thread Pedro Fernandes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836950#comment-16836950
 ] 

Pedro Fernandes commented on SPARK-23986:
-

[~sdangi1], how many operations do you have in each aggregation? If it is 
larger than 5, I would suggest to slice it even more.

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2019-05-09 Thread Siddharth Dangi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836658#comment-16836658
 ] 

Siddharth Dangi commented on SPARK-23986:
-

[~pedromorfeu] I tried the workaround you mentioned above, but still 
encountered this issue (my code is below).

Since I don't have access to Spark 2.3.1, is there another workaround I can try 
with Spark 2.3.0?
{code:java}
val sumColNames = List[String](...)  // list of 25 Strings
val sumCols: List[Column] = sumColNames.map(name => sum(col(name)))

/** code that causes error */
val output = input
.groupBy(groupByColNames map col: _*)
.agg(sumCols.head, sumCols.tail: _*)

/** workaround I tried */
val middleIdx = sumCols.length / 2
val sumColsFirstHalf = sumCols.slice(0, middleIdx)
val sumColsSecondHalf = sumCols.slice(middleIdx, sumCols.length)

val grouped = input.groupBy(groupByCols)
val data1 = grouped.agg(sumColsFirstHalf.head, sumColsFirstHalf.tail: _*)
val data2 = grouped.agg(sumColsSecondHalf.head, sumColsSecondHalf.tail: _*)
val output = data1.join(data2, groupByColNames)
{code}

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2019-03-05 Thread Pedro Fernandes (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16784482#comment-16784482
 ] 

Pedro Fernandes commented on SPARK-23986:
-

Guys,

Is there a workaround for the folks that can't upgrade Spark version?

Thanks.

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-09-11 Thread Dmitry Zanozin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610412#comment-16610412
 ] 

Dmitry Zanozin commented on SPARK-23986:


Please accept my apologies. I had to spend more time on this investigation. 
That was a problem with one specific machine configuration where we run the 
tests and which had a mess with old and new spark/hadoop versions.
Runs fine on other machines and works as expected on the test machine after 
cleaning up Spark 2.3.0 leftovers.
Now generates the method signature as expected:
{code}
private void agg_doConsume_1(byte agg_expr_0_1, boolean agg_exprIsNull_0_1,
 short agg_expr_1_1, boolean agg_exprIsNull_1_1,
...
{code}
Thank you for your time and sorry again!

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-09-11 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610334#comment-16610334
 ] 

Marco Gaido commented on SPARK-23986:
-

[~dzanozin] can you then please try the reproducer above reported in the JIRA 
in your env? I just tried it on 2.3.1 and it works for me. At least we can be 
sure that this is a different issue. Without a reproducer is nearly impossible 
to work on this, because - as I mentioned you - you seem to miss the patch, 
otherwise {{agg_doConsume1}} should have been {{agg_doConsume_1}}.

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-09-11 Thread Dmitry Zanozin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610330#comment-16610330
 ] 

Dmitry Zanozin commented on SPARK-23986:


Well, it's a bit hard to provide the exact code we use when this failure 
occurs, many aggregated attributes with several custom UDFs. The same code 
works fine with a different dataframe which differs from the first one by 
having 1 string attribute instead of 4 int ones (among other 22 attributes). 
Here is a DF schema which fails:
{noformat}
scala> df.printSchema()
root
 |-- attr1: timestamp (nullable = true)
 |-- attr2: string (nullable = true)
 |-- attr3: integer (nullable = true)
 |-- attr4: integer (nullable = true)
 |-- attr5: string (nullable = true)
 |-- attr6: integer (nullable = true)
 |-- attr7: byte (nullable = true)
 |-- attr8: integer (nullable = true)
 |-- attr9: timestamp (nullable = true)
 |-- attr10: string (nullable = true)
 |-- attr11: byte (nullable = true)
 |-- attr12: short (nullable = true)
 |-- attr13: short (nullable = true)
 |-- attr14: integer (nullable = true)
 |-- attr15: integer (nullable = true)
 |-- attr16: short (nullable = true)
 |-- attr17: integer (nullable = true)
 |-- attr18: string (nullable = true)
 |-- attr19: byte (nullable = true)
 |-- attr20: byte (nullable = true)
 |-- attr21: integer (nullable = true)
 |-- attr22: date (nullable = true)
{noformat}
if we replace attr12-attr15 (which represent the unique record key and are part 
of a group by statement) with a single string attr (a primary key of another 
object type) the code works fine.

Information about the execution environment:
{noformat}
spark-2.3.1-bin-without-hadoop
hadoop-2.8.4
Scala 2.11.8
Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162
{noformat}

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-09-11 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16610259#comment-16610259
 ] 

Marco Gaido commented on SPARK-23986:
-

[~dzanozin] thanks for reporting this, may you please provide a reproducer for 
that? Thanks.

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-09-10 Thread Dmitry Zanozin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609708#comment-16609708
 ] 

Dmitry Zanozin commented on SPARK-23986:


Spark 2.3.1 still generates methods with duplicate parameter names. I've just 
got this method (which obviously failed with the following exception: "\{{ERROR 
CodeGenerator:91 - failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
686, Column 28: Redefinition of parameter "agg_expr_21"}}":

{code}

{color:#808080}/* 686 */
{color}{color:#cc7832}private void {color}agg_doConsume1({color:#cc7832}byte 
{color}agg_expr_01{color:#cc7832}, boolean 
{color}agg_exprIsNull_01{color:#cc7832},
{color}{color:#cc7832}  short {color}agg_expr_11{color:#cc7832}, boolean 
{color}agg_exprIsNull_11{color:#cc7832},
{color}{color:#cc7832}  short {color}agg_expr_21{color:#cc7832}, boolean 
{color}agg_exprIsNull_21{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_31{color:#cc7832}, boolean 
{color}agg_exprIsNull_31{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_41{color:#cc7832}, boolean 
{color}agg_exprIsNull_41{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_51{color:#cc7832}, boolean 
{color}agg_exprIsNull_51{color:#cc7832},
{color}  UTF8String agg_expr_61{color:#cc7832}, boolean 
{color}agg_exprIsNull_61{color:#cc7832},
{color}{color:#cc7832}  byte {color}agg_expr_71{color:#cc7832}, boolean 
{color}agg_exprIsNull_71{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_81{color:#cc7832}, boolean 
{color}agg_exprIsNull_81{color:#cc7832},
{color}{color:#cc7832}  double {color}agg_expr_91{color:#cc7832}, boolean 
{color}agg_exprIsNull_91{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_101{color:#cc7832}, boolean 
{color}agg_exprIsNull_101{color:#cc7832},
{color}{color:#cc7832}  double {color}agg_expr_111{color:#cc7832}, boolean 
{color}agg_exprIsNull_111{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_121{color:#cc7832}, boolean 
{color}agg_exprIsNull_121{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_131{color:#cc7832}, boolean 
{color}agg_exprIsNull_131{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_141{color:#cc7832}, boolean 
{color}agg_exprIsNull_141{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_151{color:#cc7832}, boolean 
{color}agg_exprIsNull_151{color:#cc7832},
{color}{color:#cc7832}  boolean {color}agg_expr_161{color:#cc7832}, boolean 
{color}agg_exprIsNull_161{color:#cc7832},
{color}{color:#cc7832}  long {color}agg_expr_171{color:#cc7832},
{color}{color:#cc7832}  byte {color}agg_expr_18{color:#cc7832}, boolean 
{color}agg_exprIsNull_18{color:#cc7832},
{color}{color:#cc7832}  boolean {color}agg_expr_19{color:#cc7832}, boolean 
{color}agg_exprIsNull_19{color:#cc7832},
{color}{color:#cc7832}  byte {color}agg_expr_20{color:#cc7832}, boolean 
{color}agg_exprIsNull_20{color:#cc7832},
{color}{color:#cc7832}  boolean {color}agg_expr_21{color:#cc7832}, boolean 
{color}agg_exprIsNull_21{color:#cc7832},
{color}{color:#cc7832}  short {color}agg_expr_22{color:#cc7832}, boolean 
{color}agg_exprIsNull_22{color:#cc7832},
{color}{color:#cc7832}  int {color}agg_expr_23{color:#cc7832}, boolean 
{color}agg_exprIsNull_23) {color:#cc7832}throws {color}java.io.IOException {

{code}

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated 

[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-16 Thread Michel Davit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439915#comment-16439915
 ] 

Michel Davit commented on SPARK-23986:
--

Thx [~mgaido]. I didn't have time to setup the environment to submit the pull 
request this weekend :)

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Priority: Major
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-16 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439417#comment-16439417
 ] 

Apache Spark commented on SPARK-23986:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21080

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Priority: Major
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-16 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439398#comment-16439398
 ] 

Marco Gaido commented on SPARK-23986:
-

[~RustedBones] I was able to reproduce. Yes, I do agree with you in all your 
analysis and also with your proposal of solution. I am submitting a patch. 
Thanks for reporting this.

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Priority: Major
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12]}}. We then have a parameter name 
> conflicts in the generated code: {{agg_expr_11.}}
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-16 Thread Michel Davit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16439318#comment-16439318
 ] 

Michel Davit commented on SPARK-23986:
--

I tested on Spark v2.3.0
I attached the generated code: [^spark-generated.java] . Here is the faulty  
line (467):
{code:java}
private void agg_doConsume1(int agg_expr_01, double agg_expr_11, boolean 
agg_exprIsNull_1, long agg_expr_21, boolean agg_exprIsNull_2, double 
agg_expr_31, boolean agg_exprIsNull_3, long agg_expr_41, boolean 
agg_exprIsNull_4, double agg_expr_51, boolean agg_exprIsNull_5, long 
agg_expr_61, boolean agg_exprIsNull_6, double agg_expr_7, boolean 
agg_exprIsNull_7, long agg_expr_8, boolean agg_exprIsNull_8, double agg_expr_9, 
boolean agg_exprIsNull_9, long agg_expr_10, boolean agg_exprIsNull_10, double 
agg_expr_11, boolean agg_exprIsNull_11, long agg_expr_12, boolean 
agg_exprIsNull_12) throws java.io.IOException
{code}
Maybe a precision: the code does not throw, it just logs an error. I also 
checked the computed average values, everything seems correct.

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Priority: Major
> Attachments: spark-generated.java
>
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name 
> conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}.
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438885#comment-16438885
 ] 

Kazuaki Ishizaki commented on SPARK-23986:
--

While I also checked it with branch-2.3, it works well without any exception.

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Priority: Major
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name 
> conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}.
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23986) CompileException when using too many avg aggregation after joining

2018-04-15 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438818#comment-16438818
 ] 

Kazuaki Ishizaki commented on SPARK-23986:
--

Thank for reporting an issue with deep dive.

When I run this repro with the latest master, it works well without an 
exception. When I checked the generated code, I cannot find variables 
{{agg_expr_[21|31|41|51|61]}}. 
Would it be possible to attach the log file of the generated code?

> CompileException when using too many avg aggregation after joining
> --
>
> Key: SPARK-23986
> URL: https://issues.apache.org/jira/browse/SPARK-23986
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Michel Davit
>Priority: Major
>
> Considering the following code:
> {code:java}
> val df1: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, 1, 2, 3, 4, 5, 6)))
>   .toDF("key", "col1", "col2", "col3", "col4", "col5", "col6")
> val df2: DataFrame = sparkSession.sparkContext
>   .makeRDD(Seq((0, "val1", "val2")))
>   .toDF("key", "dummy1", "dummy2")
> val agg = df1
>   .join(df2, df1("key") === df2("key"), "leftouter")
>   .groupBy(df1("key"))
>   .agg(
> avg("col2").as("avg2"),
> avg("col3").as("avg3"),
> avg("col4").as("avg4"),
> avg("col1").as("avg1"),
> avg("col5").as("avg5"),
> avg("col6").as("avg6")
>   )
> val head = agg.take(1)
> {code}
> This logs the following exception:
> {code:java}
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 467, Column 28: Redefinition of parameter "agg_expr_11"
> {code}
> I am not a spark expert but after investigation, I realized that the 
> generated {{doConsume}} method is responsible of the exception.
> Indeed, {{avg}} calls several times 
> {{org.apache.spark.sql.execution.CodegenSupport.constructDoConsumeFunction}}. 
> The 1st time with the 'avg' Expr and a second time for the base aggregation 
> Expr (count and sum).
> The problem comes from the generation of parameters in CodeGenerator:
> {code:java}
>   /**
>* Returns a term name that is unique within this instance of a 
> `CodegenContext`.
>*/
>   def freshName(name: String): String = synchronized {
> val fullName = if (freshNamePrefix == "") {
>   name
> } else {
>   s"${freshNamePrefix}_$name"
> }
> if (freshNameIds.contains(fullName)) {
>   val id = freshNameIds(fullName)
>   freshNameIds(fullName) = id + 1
>   s"$fullName$id"
> } else {
>   freshNameIds += fullName -> 1
>   fullName
> }
>   }
> {code}
> The {{freshNameIds}} already contains {{agg_expr_[1..6]}} from the 1st call.
>  The second call is made with {{agg_expr_[1..12]}} and generates the 
> following names:
>  {{agg_expr_[11|21|31|41|51|61|11|12}}. We then have 2 parameter name 
> conflicts in the generated code: {{agg_expr_11}} and {{agg_expr_12}}.
> Appending the 'id' in s"$fullName$id" to generate unique term name is source 
> of conflict. Maybe simply using undersoce can solve this issue : 
> $fullName_$id"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org