[jira] [Commented] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602741#comment-17602741
 ] 

Apache Spark commented on SPARK-40362:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/37851

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize, for example 
> BinaryComparison 
> Consider following expression:
> a + b > 10
>          GT
>             |
> a + b          10
> The BinaryComparison operator in the precanonicalize, first precanonicalizes  
> children & then   may swap operands based on left /right hashCode inequality..
> lets say  Add(a + b) .hashCode is >  10.hashCode as a result GT is converted 
> to LT
> But If the same tree is created 
>            GT
>             |
>  b + a      10
> The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that 
> for this tree
>  Add(b + a) .hashCode is <  10.hashCode  in which case GT remains as is.
> Thus to similar trees result in different canonicalization , one having GT 
> other having LT 
>  
> The problem occurs because  for commutative expressions the canonicalization 
> normalizes the expression with consistent hashCode which is not the case with 
> precanonicalize as the hashCode of commutative expression 's precanonicalize 
> and post canonicalize are different.
>  
>  
> The test 
> {quote}test("bug X")
> Unknown macro: \{     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)    
> val y = tr1.where('a.attr + 'c.attr > 10).analyze    val fullCond = 
> y.asInstanceOf[Filter].condition.clone()   val addExpr = (fullCond match 
> Unknown macro}
> ).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions, the 
> precanonicalize should be overridden and  
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize. effectively for commutative operands ( add, or , 
> multiply , and etc) canonicalize and precanonicalize should be same.
> PR:
> [https://github.com/apache/spark/pull/37824]
>  
>  
> I am also trying a better fix, where by the idea is that for commutative 
> expressions the murmur hashCode are caluculated using unorderedHash so that 
> it is order  independent ( i.e symmetric).
> The above approach works fine , but in case of Least & Greatest, the 
> Product's element is  a Seq,  and that messes with consistency of hashCode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40362) Bug in Canonicalization of expressions like Add & Multiply i.e Commutative Operators

2022-09-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602739#comment-17602739
 ] 

Apache Spark commented on SPARK-40362:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/37851

> Bug in Canonicalization of expressions like Add & Multiply i.e Commutative 
> Operators
> 
>
> Key: SPARK-40362
> URL: https://issues.apache.org/jira/browse/SPARK-40362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Asif
>Priority: Major
>  Labels: spark-sql
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> In the canonicalization code which is now in two stages, canonicalization 
> involving Commutative operators is broken, if they are subexpressions of 
> certain type of expressions which override precanonicalize, for example 
> BinaryComparison 
> Consider following expression:
> a + b > 10
>          GT
>             |
> a + b          10
> The BinaryComparison operator in the precanonicalize, first precanonicalizes  
> children & then   may swap operands based on left /right hashCode inequality..
> lets say  Add(a + b) .hashCode is >  10.hashCode as a result GT is converted 
> to LT
> But If the same tree is created 
>            GT
>             |
>  b + a      10
> The hashCode of Add(b, a) is not same as Add(a, b) , thus it is possible that 
> for this tree
>  Add(b + a) .hashCode is <  10.hashCode  in which case GT remains as is.
> Thus to similar trees result in different canonicalization , one having GT 
> other having LT 
>  
> The problem occurs because  for commutative expressions the canonicalization 
> normalizes the expression with consistent hashCode which is not the case with 
> precanonicalize as the hashCode of commutative expression 's precanonicalize 
> and post canonicalize are different.
>  
>  
> The test 
> {quote}test("bug X")
> Unknown macro: \{     val tr1 = LocalRelation('c.int, 'b.string, 'a.int)    
> val y = tr1.where('a.attr + 'c.attr > 10).analyze    val fullCond = 
> y.asInstanceOf[Filter].condition.clone()   val addExpr = (fullCond match 
> Unknown macro}
> ).clone().asInstanceOf[Add]
> val canonicalizedFullCond = fullCond.canonicalized
> // swap the operands of add
> val newAddExpr = Add(addExpr.right, addExpr.left)
> // build a new condition which is same as the previous one, but with operands 
> of //Add reversed 
> val builtCondnCanonicalized = GreaterThan(newAddExpr, 
> Literal(10)).canonicalized
> assertEquals(canonicalizedFullCond, builtCondnCanonicalized)
> }
> {quote}
> This test fails.
> The fix which I propose is that for the commutative expressions, the 
> precanonicalize should be overridden and  
> Canonicalize.reorderCommutativeOperators be invoked on the expression instead 
> of at place of canonicalize. effectively for commutative operands ( add, or , 
> multiply , and etc) canonicalize and precanonicalize should be same.
> PR:
> [https://github.com/apache/spark/pull/37824]
>  
>  
> I am also trying a better fix, where by the idea is that for commutative 
> expressions the murmur hashCode are caluculated using unorderedHash so that 
> it is order  independent ( i.e symmetric).
> The above approach works fine , but in case of Least & Greatest, the 
> Product's element is  a Seq,  and that messes with consistency of hashCode.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40142) Make pyspark.sql.functions examples self-contained

2022-09-10 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602709#comment-17602709
 ] 

Apache Spark commented on SPARK-40142:
--

User 'khalidmammadov' has created a pull request for this issue:
https://github.com/apache/spark/pull/37850

> Make pyspark.sql.functions examples self-contained
> --
>
> Key: SPARK-40142
> URL: https://issues.apache.org/jira/browse/SPARK-40142
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40402) StructType level metadata

2022-09-10 Thread Igor Suhorukov (Jira)
Igor Suhorukov created SPARK-40402:
--

 Summary: StructType level metadata 
 Key: SPARK-40402
 URL: https://issues.apache.org/jira/browse/SPARK-40402
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.1
Reporter: Igor Suhorukov


Please support StructType level metadata like StructField does. It allows data 
engineers provide better documentation of data schemas used in pipelines and 
store this metadata it in catalog. This feature also will simplify table level 
comment propagation from RDBMS to Spark



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-39696) Uncaught exception in thread executor-heartbeater java.util.ConcurrentModificationException: mutation occurred during iteration

2022-09-10 Thread Paul Soule (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-39696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17602639#comment-17602639
 ] 

Paul Soule commented on SPARK-39696:


It happens randomly. Run a job and all good and then again and the exception 
may, or may not, appear.

> Uncaught exception in thread executor-heartbeater 
> java.util.ConcurrentModificationException: mutation occurred during iteration
> ---
>
> Key: SPARK-39696
> URL: https://issues.apache.org/jira/browse/SPARK-39696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
> Environment: Spark 3.3.0 (spark-3.3.0-bin-hadoop3-scala2.13 
> distribution)
> Scala 2.13.8 / OpenJDK 17.0.3 application compilation
> Alpine Linux 3.14.3
> JVM OpenJDK 64-Bit Server VM Temurin-17.0.1+12
>Reporter: Stephen Mcmullan
>Priority: Major
>
> {noformat}
> 2022-06-21 18:17:49.289Z ERROR [executor-heartbeater] 
> org.apache.spark.util.Utils - Uncaught exception in thread 
> executor-heartbeater
> java.util.ConcurrentModificationException: mutation occurred during iteration
> at 
> scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43)
>  ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47)
>  ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:873) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:869) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray(IterableOnce.scala:852) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.IterableOnceOps.copyToArray$(IterableOnce.scala:852) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterator.copyToArray(Iterator.scala:1293) 
> ~[scala-library-2.13.8.jar:?]
> at 
> scala.collection.immutable.VectorStatics$.append1IfSpace(Vector.scala:1959) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector1.appendedAll0(Vector.scala:425) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector.appendedAll(Vector.scala:203) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.immutable.Vector.appendedAll(Vector.scala:113) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.SeqOps.concat(Seq.scala:187) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.SeqOps.concat$(Seq.scala:187) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractSeq.concat(Seq.scala:1161) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOps.$plus$plus(Iterable.scala:726) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOps.$plus$plus$(Iterable.scala:726) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterable.$plus$plus(Iterable.scala:926) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.executor.TaskMetrics.accumulators(TaskMetrics.scala:261) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> org.apache.spark.executor.Executor.$anonfun$reportHeartBeat$1(Executor.scala:1042)
>  ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at scala.collection.IterableOnceOps.foreach(IterableOnce.scala:563) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.IterableOnceOps.foreach$(IterableOnce.scala:561) 
> ~[scala-library-2.13.8.jar:?]
> at scala.collection.AbstractIterable.foreach(Iterable.scala:926) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:1036) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:238) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18) 
> ~[scala-library-2.13.8.jar:?]
> at 
> org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2066) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46) 
> ~[spark-core_2.13-3.3.0.jar:3.3.0]
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) ~[?:?]
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) 
> ~[?:?]
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
>  ~[?:?]
> at 
>