date:20210518

[jira] [Commented] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-05-18 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347334#comment-17347334
 ] 

Wenchen Fan commented on SPARK-34674:
-

This has been reverted. [~dongjoon] is there any context?

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Assignee: Sergey Kotlov
>Priority: Major
> Fix For: 3.1.2, 3.2.0
>
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-34674:

Comment: was deleted

(was: This has been reverted. [~dongjoon] is there any context?)

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey Kotlov
>Assignee: Sergey Kotlov
>Priority: Major
> Fix For: 3.1.2, 3.2.0
>
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35421) Remove redundant ProjectExec from streaming queries with V2Relation

2021-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35421.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32570
[https://github.com/apache/spark/pull/32570]

> Remove redundant ProjectExec from streaming queries with V2Relation
> ---
>
> Key: SPARK-35421
> URL: https://issues.apache.org/jira/browse/SPARK-35421
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> Streaming queries with V2Relation can have redundant ProjectExec in it's 
> physical plan.
> You can easily reproduce with the following code.
> {code}
> import org.apache.spark.sql.streaming.Trigger
> val query = spark.
>   readStream.
>   format("rate").
>   option("rowsPerSecond", 1000).
>   option("rampUpTime", "10s").
>   load().
>   selectExpr("timestamp", "100",  "value").  
>   writeStream.
>   format("console").
>   trigger(Trigger.ProcessingTime("5 seconds")).
>   // trigger(Trigger.Continuous("5 seconds")). // You can reproduce with 
> continuous processing too.
>   outputMode("append").
>   start()
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination

2021-05-18 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35439:

Description: 
EquivalentExpressions maintains a map of equivalent expressions. It is HashMap 
now so the insertion order is not guaranteed to be preserved later. 
Subexpression elimination relies on retrieving subexpressions from the map. If 
there is child-parent relationships among the subexpressions, we want the child 
expressions come first than parent expressions, so we can replace child 
expressions in parent expressions with subexpression evaluation.

For example, we have two different expressions Add(Literal(1), Literal(2)) and 
Add(Literal(3), add).

Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can 
deal with it.

addExprTree(add)
addExprTree(Add(Literal(3), add))
addExprTree(Add(Literal(3), add))

Case 2: parent subexpr comes first. For this case, we need to sort equivalent 
expressions.

addExprTree(Add(Literal(3), add))  => We add `Add(Literal(3), add)` into the 
map first, then add `add` into the map
addExprTree(add)
addExprTree(Add(Literal(3), add))


  was:
EquivalentExpressions maintains a map of equivalent expressions. It is HashMap 
now so the insertion order is not guaranteed to be preserved later. 
Subexpression elimination relies on retrieving subexpressions from the map. If 
there is child-parent relationships among the subexpressions, we want the child 
expressions come first than parent expressions, so we can replace child 
expressions in parent expressions with subexpression evaluation.

Although we add expressions recursively into the map with depth-first approach, 
when we retrieve the map values, it is not guaranteed that the order is 
preserved. We should use LinkedHashMap for this usage.


> Children subexpr should come first than parent subexpr in subexpression 
> elimination
> ---
>
> Key: SPARK-35439
> URL: https://issues.apache.org/jira/browse/SPARK-35439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> EquivalentExpressions maintains a map of equivalent expressions. It is 
> HashMap now so the insertion order is not guaranteed to be preserved later. 
> Subexpression elimination relies on retrieving subexpressions from the map. 
> If there is child-parent relationships among the subexpressions, we want the 
> child expressions come first than parent expressions, so we can replace child 
> expressions in parent expressions with subexpression evaluation.
> For example, we have two different expressions Add(Literal(1), Literal(2)) 
> and Add(Literal(3), add).
> Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can 
> deal with it.
> addExprTree(add)
> addExprTree(Add(Literal(3), add))
> addExprTree(Add(Literal(3), add))
> Case 2: parent subexpr comes first. For this case, we need to sort equivalent 
> expressions.
> addExprTree(Add(Literal(3), add))  => We add `Add(Literal(3), add)` into the 
> map first, then add `add` into the map
> addExprTree(add)
> addExprTree(Add(Literal(3), add))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination

2021-05-18 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35439:

Priority: Major  (was: Minor)

> Children subexpr should come first than parent subexpr in subexpression 
> elimination
> ---
>
> Key: SPARK-35439
> URL: https://issues.apache.org/jira/browse/SPARK-35439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> EquivalentExpressions maintains a map of equivalent expressions. It is 
> HashMap now so the insertion order is not guaranteed to be preserved later. 
> Subexpression elimination relies on retrieving subexpressions from the map. 
> If there is child-parent relationships among the subexpressions, we want the 
> child expressions come first than parent expressions, so we can replace child 
> expressions in parent expressions with subexpression evaluation.
> For example, we have two different expressions Add(Literal(1), Literal(2)) 
> and Add(Literal(3), add).
> Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can 
> deal with it.
> addExprTree(add)
> addExprTree(Add(Literal(3), add))
> addExprTree(Add(Literal(3), add))
> Case 2: parent subexpr comes first. For this case, we need to sort equivalent 
> expressions.
> addExprTree(Add(Literal(3), add))  => We add `Add(Literal(3), add)` into the 
> map first, then add `add` into the map
> addExprTree(add)
> addExprTree(Add(Literal(3), add))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination

2021-05-18 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35439:

Affects Version/s: (was: 3.1.1)
   (was: 3.0.2)

> Children subexpr should come first than parent subexpr in subexpression 
> elimination
> ---
>
> Key: SPARK-35439
> URL: https://issues.apache.org/jira/browse/SPARK-35439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> EquivalentExpressions maintains a map of equivalent expressions. It is 
> HashMap now so the insertion order is not guaranteed to be preserved later. 
> Subexpression elimination relies on retrieving subexpressions from the map. 
> If there is child-parent relationships among the subexpressions, we want the 
> child expressions come first than parent expressions, so we can replace child 
> expressions in parent expressions with subexpression evaluation.
> For example, we have two different expressions Add(Literal(1), Literal(2)) 
> and Add(Literal(3), add).
> Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can 
> deal with it.
> addExprTree(add)
> addExprTree(Add(Literal(3), add))
> addExprTree(Add(Literal(3), add))
> Case 2: parent subexpr comes first. For this case, we need to sort equivalent 
> expressions.
> addExprTree(Add(Literal(3), add))  => We add `Add(Literal(3), add)` into the 
> map first, then add `add` into the map
> addExprTree(add)
> addExprTree(Add(Literal(3), add))



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination

2021-05-18 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-35439:

Summary: Children subexpr should come first than parent subexpr in 
subexpression elimination  (was: Use LinkedHashMap as the map of equivalent 
expressions to preserve insertion order)

> Children subexpr should come first than parent subexpr in subexpression 
> elimination
> ---
>
> Key: SPARK-35439
> URL: https://issues.apache.org/jira/browse/SPARK-35439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> EquivalentExpressions maintains a map of equivalent expressions. It is 
> HashMap now so the insertion order is not guaranteed to be preserved later. 
> Subexpression elimination relies on retrieving subexpressions from the map. 
> If there is child-parent relationships among the subexpressions, we want the 
> child expressions come first than parent expressions, so we can replace child 
> expressions in parent expressions with subexpression evaluation.
> Although we add expressions recursively into the map with depth-first 
> approach, when we retrieve the map values, it is not guaranteed that the 
> order is preserved. We should use LinkedHashMap for this usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35129) Construct year-month interval column from integral fields

2021-05-18 Thread Max Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347306#comment-17347306
 ] 

Max Gekk commented on SPARK-35129:
--

[~mrpowers] Are you working on this?

> Construct year-month interval column from integral fields
> -
>
> Key: SPARK-35129
> URL: https://issues.apache.org/jira/browse/SPARK-35129
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Create new function similar to make_interval() (or extend the make_interval() 
> function) which can construct YearMonthIntervalType values from the year, 
> month fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-35291) NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.CreateExternalRow_0$(Unknown Source)

2021-05-18 Thread Umar Asir (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347302#comment-17347302
 ] 

Umar Asir edited comment on SPARK-35291 at 5/19/21, 5:19 AM:
-

It's an issue with delta. but null pointer is thrown from 
"org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.CreateExternalRow_0$(Unknown
 Source)"


was (Author: umerasir):
It's an issue with delta. 

> NullPointerException at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.CreateExternalRow_0$(Unknown
>  Source)
> 
>
> Key: SPARK-35291
> URL: https://issues.apache.org/jira/browse/SPARK-35291
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.3.1, 3.0.2
>Reporter: Umar Asir
>Priority: Major
> Attachments: NotNullIssue.scala, cdwqasourceupdate.7z, 
> cdwqatgtupdate.7z, pom.xml, run1.log
>
>
> We are trying to merge data using DeltaTable's merge API. On inserting a null 
> value into the not-null column results in NullPointerException instead of 
> throwing constrain violation error 
> *Code :*  
> {code:java}
> package com.uasir.cdw.delta
> import org.apache.spark.sql._
> import io.delta.tables._
> object NotNullIssue {
>   def main(args: Array[String]): Unit = {
> System.setProperty("hadoop.home.dir", "C:\\Tools\\hadoop\\")
> val spark = SparkSession
> .builder()
> .appName("DFMergeTest")
> .master("local[*]")
> .config("spark.sql.extensions", 
> "io.delta.sql.DeltaSparkSessionExtension")
> .config("spark.sql.catalog.spark_catalog", 
> "org.apache.spark.sql.delta.catalog.DeltaCatalog")
> .config("spark.testing.memory", "571859200")
> .getOrCreate()
> println("Reading from the source table")
> val df = spark.read.format("delta").load("C:\\Input\\cdwqasourceupdate") 
> \\PFA cdwqasourceupdate.7z
> df.show()
> println("Reading from the target table")
> val tgtDf = spark.read.format("delta").load("C:\\Input\\cdwqatgtupdate") 
> \\PFA cdwqatgtupdate.7z
> tgtDf.show()
> val sourceTable = "source"
> val targetDataTable = "target"
> val colMap= scala.collection.mutable.Map[String,String]()
> val sourceFields = df.schema.fieldNames
> val targetFields = tgtDf.schema.fieldNames
> for ( i <- 0 until targetFields.length) {
>   colMap(targetFields(i)) = sourceTable + "." + sourceFields(i)
> }
> /* colMap will be generated as :
> TGTID -> c1_ID
> TGT_NAME -> c2_NAME
> TGT_ADDRESS -> c3_address
> TGT_DOB -> c4_dob
> */
> println("update")
> DeltaTable.forPath(spark, "C:\\Input\\cdwqatgtupdate")
> .as(targetDataTable)
> .merge(
>   df.as(sourceTable),
>   targetDataTable + "." + "TGTID" + " = " + sourceTable + "." + 
> "c1_ID" )
> .whenMatched()
> .updateExpr(colMap)
> .execute()
> println("Reading from target the table after operation")
> tgtDf.show()
>   }
> }
> {code}
>  *Error :*
> {code:java}
> Caused by: java.lang.RuntimeException: Error while decoding: 
> java.lang.NullPointerExceptionCaused by: java.lang.RuntimeException: Error 
> while decoding: java.lang.NullPointerExceptioncreateexternalrow(input[0, int, 
> false], input[1, string, false].toString, input[2, string, true].toString, 
> staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
> ObjectType(class java.sql.Date), toJavaDate, input[3, date, true], true, 
> false), StructField(TGTID,IntegerType,false), 
> StructField(TGT_NAME,StringType,false), 
> StructField(TGT_ADDRESS,StringType,true), StructField(TGT_DOB,DateType,true)) 
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:188)
>  at 
> org.apache.spark.sql.delta.commands.MergeIntoCommand$JoinedRowProcessor.$anonfun$processPartition$9(MergeIntoCommand.scala:565)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:265)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWrite

[jira] [Commented] (SPARK-35291) NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.CreateExternalRow_0$(Unknown Source)

2021-05-18 Thread Umar Asir (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347302#comment-17347302
 ] 

Umar Asir commented on SPARK-35291:
---

It's an issue with delta. 

> NullPointerException at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.CreateExternalRow_0$(Unknown
>  Source)
> 
>
> Key: SPARK-35291
> URL: https://issues.apache.org/jira/browse/SPARK-35291
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.3.1, 3.0.2
>Reporter: Umar Asir
>Priority: Major
> Attachments: NotNullIssue.scala, cdwqasourceupdate.7z, 
> cdwqatgtupdate.7z, pom.xml, run1.log
>
>
> We are trying to merge data using DeltaTable's merge API. On inserting a null 
> value into the not-null column results in NullPointerException instead of 
> throwing constrain violation error 
> *Code :*  
> {code:java}
> package com.uasir.cdw.delta
> import org.apache.spark.sql._
> import io.delta.tables._
> object NotNullIssue {
>   def main(args: Array[String]): Unit = {
> System.setProperty("hadoop.home.dir", "C:\\Tools\\hadoop\\")
> val spark = SparkSession
> .builder()
> .appName("DFMergeTest")
> .master("local[*]")
> .config("spark.sql.extensions", 
> "io.delta.sql.DeltaSparkSessionExtension")
> .config("spark.sql.catalog.spark_catalog", 
> "org.apache.spark.sql.delta.catalog.DeltaCatalog")
> .config("spark.testing.memory", "571859200")
> .getOrCreate()
> println("Reading from the source table")
> val df = spark.read.format("delta").load("C:\\Input\\cdwqasourceupdate") 
> \\PFA cdwqasourceupdate.7z
> df.show()
> println("Reading from the target table")
> val tgtDf = spark.read.format("delta").load("C:\\Input\\cdwqatgtupdate") 
> \\PFA cdwqatgtupdate.7z
> tgtDf.show()
> val sourceTable = "source"
> val targetDataTable = "target"
> val colMap= scala.collection.mutable.Map[String,String]()
> val sourceFields = df.schema.fieldNames
> val targetFields = tgtDf.schema.fieldNames
> for ( i <- 0 until targetFields.length) {
>   colMap(targetFields(i)) = sourceTable + "." + sourceFields(i)
> }
> /* colMap will be generated as :
> TGTID -> c1_ID
> TGT_NAME -> c2_NAME
> TGT_ADDRESS -> c3_address
> TGT_DOB -> c4_dob
> */
> println("update")
> DeltaTable.forPath(spark, "C:\\Input\\cdwqatgtupdate")
> .as(targetDataTable)
> .merge(
>   df.as(sourceTable),
>   targetDataTable + "." + "TGTID" + " = " + sourceTable + "." + 
> "c1_ID" )
> .whenMatched()
> .updateExpr(colMap)
> .execute()
> println("Reading from target the table after operation")
> tgtDf.show()
>   }
> }
> {code}
>  *Error :*
> {code:java}
> Caused by: java.lang.RuntimeException: Error while decoding: 
> java.lang.NullPointerExceptionCaused by: java.lang.RuntimeException: Error 
> while decoding: java.lang.NullPointerExceptioncreateexternalrow(input[0, int, 
> false], input[1, string, false].toString, input[2, string, true].toString, 
> staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, 
> ObjectType(class java.sql.Date), toJavaDate, input[3, date, true], true, 
> false), StructField(TGTID,IntegerType,false), 
> StructField(TGT_NAME,StringType,false), 
> StructField(TGT_ADDRESS,StringType,true), StructField(TGT_DOB,DateType,true)) 
> at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:188)
>  at 
> org.apache.spark.sql.delta.commands.MergeIntoCommand$JoinedRowProcessor.$anonfun$processPartition$9(MergeIntoCommand.scala:565)
>  at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:265)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
>  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:127) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
>  at org.apache.spark.ut

[jira] [Assigned] (SPARK-35440) Add language type to `ExpressionInfo` for UDF

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35440:


Assignee: Apache Spark

> Add language type to `ExpressionInfo` for UDF
> -
>
> Key: SPARK-35440
> URL: https://issues.apache.org/jira/browse/SPARK-35440
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.20
>Reporter: Linhong Liu
>Assignee: Apache Spark
>Priority: Major
>
> add "scala", "java", "python", "hive", "built-in"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35440) Add language type to `ExpressionInfo` for UDF

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35440:


Assignee: (was: Apache Spark)

> Add language type to `ExpressionInfo` for UDF
> -
>
> Key: SPARK-35440
> URL: https://issues.apache.org/jira/browse/SPARK-35440
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.20
>Reporter: Linhong Liu
>Priority: Major
>
> add "scala", "java", "python", "hive", "built-in"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35440) Add language type to `ExpressionInfo` for UDF

2021-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347299#comment-17347299
 ] 

Apache Spark commented on SPARK-35440:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/32587

> Add language type to `ExpressionInfo` for UDF
> -
>
> Key: SPARK-35440
> URL: https://issues.apache.org/jira/browse/SPARK-35440
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.20
>Reporter: Linhong Liu
>Priority: Major
>
> add "scala", "java", "python", "hive", "built-in"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35440) Add language type to `ExpressionInfo` for UDF

2021-05-18 Thread Linhong Liu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Linhong Liu updated SPARK-35440:

Description: add "scala", "java", "python", "hive", "built-in"

> Add language type to `ExpressionInfo` for UDF
> -
>
> Key: SPARK-35440
> URL: https://issues.apache.org/jira/browse/SPARK-35440
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.20
>Reporter: Linhong Liu
>Priority: Major
>
> add "scala", "java", "python", "hive", "built-in"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35440) Add language type to `ExpressionInfo` for UDF

2021-05-18 Thread Linhong Liu (Jira)

Linhong Liu created SPARK-35440:
---

 Summary: Add language type to `ExpressionInfo` for UDF
 Key: SPARK-35440
 URL: https://issues.apache.org/jira/browse/SPARK-35440
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.20
Reporter: Linhong Liu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35398) Simplify the way to get classes from ClassBodyEvaluator in CodeGenerator.updateAndGetCompilationStats method

2021-05-18 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-35398.
--
Fix Version/s: 3.2.0
 Assignee: Yang Jie
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32536

> Simplify the way to get classes from ClassBodyEvaluator in 
> CodeGenerator.updateAndGetCompilationStats method
> 
>
> Key: SPARK-35398
> URL: https://issues.apache.org/jira/browse/SPARK-35398
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Trivial
> Fix For: 3.2.0
>
>
> SPARK-35253 upgraded janino from 3.0.16 to 3.1.4, {{ClassBodyEvaluator}} 
> provides the {{getBytecodes}} method to get
> the mapping from {{ClassFile.getThisClassName}} to {{ClassFile.toByteArray}} 
> directly in this version and we don't need to get this variable by reflection 
> api anymore.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35439) Use LinkedHashMap as the map of equivalent expressions to preserve insertion order

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35439:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Use LinkedHashMap as the map of equivalent expressions to preserve insertion 
> order
> --
>
> Key: SPARK-35439
> URL: https://issues.apache.org/jira/browse/SPARK-35439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> EquivalentExpressions maintains a map of equivalent expressions. It is 
> HashMap now so the insertion order is not guaranteed to be preserved later. 
> Subexpression elimination relies on retrieving subexpressions from the map. 
> If there is child-parent relationships among the subexpressions, we want the 
> child expressions come first than parent expressions, so we can replace child 
> expressions in parent expressions with subexpression evaluation.
> Although we add expressions recursively into the map with depth-first 
> approach, when we retrieve the map values, it is not guaranteed that the 
> order is preserved. We should use LinkedHashMap for this usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35439) Use LinkedHashMap as the map of equivalent expressions to preserve insertion order

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35439:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Use LinkedHashMap as the map of equivalent expressions to preserve insertion 
> order
> --
>
> Key: SPARK-35439
> URL: https://issues.apache.org/jira/browse/SPARK-35439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> EquivalentExpressions maintains a map of equivalent expressions. It is 
> HashMap now so the insertion order is not guaranteed to be preserved later. 
> Subexpression elimination relies on retrieving subexpressions from the map. 
> If there is child-parent relationships among the subexpressions, we want the 
> child expressions come first than parent expressions, so we can replace child 
> expressions in parent expressions with subexpression evaluation.
> Although we add expressions recursively into the map with depth-first 
> approach, when we retrieve the map values, it is not guaranteed that the 
> order is preserved. We should use LinkedHashMap for this usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35439) Use LinkedHashMap as the map of equivalent expressions to preserve insertion order

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35439:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Use LinkedHashMap as the map of equivalent expressions to preserve insertion 
> order
> --
>
> Key: SPARK-35439
> URL: https://issues.apache.org/jira/browse/SPARK-35439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> EquivalentExpressions maintains a map of equivalent expressions. It is 
> HashMap now so the insertion order is not guaranteed to be preserved later. 
> Subexpression elimination relies on retrieving subexpressions from the map. 
> If there is child-parent relationships among the subexpressions, we want the 
> child expressions come first than parent expressions, so we can replace child 
> expressions in parent expressions with subexpression evaluation.
> Although we add expressions recursively into the map with depth-first 
> approach, when we retrieve the map values, it is not guaranteed that the 
> order is preserved. We should use LinkedHashMap for this usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35439) Use LinkedHashMap as the map of equivalent expressions to preserve insertion order

2021-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347286#comment-17347286
 ] 

Apache Spark commented on SPARK-35439:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32586

> Use LinkedHashMap as the map of equivalent expressions to preserve insertion 
> order
> --
>
> Key: SPARK-35439
> URL: https://issues.apache.org/jira/browse/SPARK-35439
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> EquivalentExpressions maintains a map of equivalent expressions. It is 
> HashMap now so the insertion order is not guaranteed to be preserved later. 
> Subexpression elimination relies on retrieving subexpressions from the map. 
> If there is child-parent relationships among the subexpressions, we want the 
> child expressions come first than parent expressions, so we can replace child 
> expressions in parent expressions with subexpression evaluation.
> Although we add expressions recursively into the map with depth-first 
> approach, when we retrieve the map values, it is not guaranteed that the 
> order is preserved. We should use LinkedHashMap for this usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35439) Use LinkedHashMap as the map of equivalent expressions to preserve insertion order

2021-05-18 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-35439:
---

 Summary: Use LinkedHashMap as the map of equivalent expressions to 
preserve insertion order
 Key: SPARK-35439
 URL: https://issues.apache.org/jira/browse/SPARK-35439
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1, 3.0.2, 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


EquivalentExpressions maintains a map of equivalent expressions. It is HashMap 
now so the insertion order is not guaranteed to be preserved later. 
Subexpression elimination relies on retrieving subexpressions from the map. If 
there is child-parent relationships among the subexpressions, we want the child 
expressions come first than parent expressions, so we can replace child 
expressions in parent expressions with subexpression evaluation.

Although we add expressions recursively into the map with depth-first approach, 
when we retrieve the map values, it is not guaranteed that the order is 
preserved. We should use LinkedHashMap for this usage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code

2021-05-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-35263.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32389
[https://github.com/apache/spark/pull/32389]

> Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
> ---
>
> Key: SPARK-35263
> URL: https://issues.apache.org/jira/browse/SPARK-35263
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Tests
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.2.0
>
>
> {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like:
> {code}
> val iterator = new ShuffleBlockFetcherIterator(
>   taskContext,
>   transfer,
>   blockManager,
>   blocksByAddress,
>   (_, in) => in,
>   48 * 1024 * 1024,
>   Int.MaxValue,
>   Int.MaxValue,
>   Int.MaxValue,
>   true,
>   false,
>   metrics,
>   false)
> {code}
> It's challenging to tell what the interesting parts are vs. what is just 
> being set to some default/unused value.
> Similarly but not as bad, there are 10 calls like:
> {code}
> verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), 
> any())
> {code}
> and 7 like
> {code}
> when(transfer.fetchBlocks(any(), any(), any(), any(), any(), 
> any())).thenAnswer ...
> {code}
> This can result in about 10% reduction in both lines and characters in the 
> file:
> {code}
> # Before
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> 10633950   43201 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> # After
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
>  9283609   39053 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> {code}
> It also helps readability:
> {code}
> val iterator = createShuffleBlockIteratorWithDefaults(
>   transfer,
>   blocksByAddress,
>   maxBytesInFlight = 1000L
> )
> {code}
> Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're 
> interested in here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code

2021-05-18 Thread Mridul Muralidharan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-35263:
---

Assignee: Erik Krogen

> Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
> ---
>
> Key: SPARK-35263
> URL: https://issues.apache.org/jira/browse/SPARK-35263
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Tests
>Affects Versions: 3.1.1
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
>
> {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like:
> {code}
> val iterator = new ShuffleBlockFetcherIterator(
>   taskContext,
>   transfer,
>   blockManager,
>   blocksByAddress,
>   (_, in) => in,
>   48 * 1024 * 1024,
>   Int.MaxValue,
>   Int.MaxValue,
>   Int.MaxValue,
>   true,
>   false,
>   metrics,
>   false)
> {code}
> It's challenging to tell what the interesting parts are vs. what is just 
> being set to some default/unused value.
> Similarly but not as bad, there are 10 calls like:
> {code}
> verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), 
> any())
> {code}
> and 7 like
> {code}
> when(transfer.fetchBlocks(any(), any(), any(), any(), any(), 
> any())).thenAnswer ...
> {code}
> This can result in about 10% reduction in both lines and characters in the 
> file:
> {code}
> # Before
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> 10633950   43201 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> # After
> > wc 
> > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
>  9283609   39053 
> core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
> {code}
> It also helps readability:
> {code}
> val iterator = createShuffleBlockIteratorWithDefaults(
>   transfer,
>   blocksByAddress,
>   maxBytesInFlight = 1000L
> )
> {code}
> Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're 
> interested in here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35370) IllegalArgumentException when loading a PipelineModel with Spark 3

2021-05-18 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-35370.
--
Resolution: Not A Problem

> IllegalArgumentException when loading a PipelineModel with Spark 3
> --
>
> Key: SPARK-35370
> URL: https://issues.apache.org/jira/browse/SPARK-35370
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 3.1.0, 3.1.1
> Environment: spark 3.1.1
>Reporter: Avenash Kabeera
>Priority: Minor
>  Labels: V3, decisiontree, scala, treemodels
>
> Hi, 
> This is a followup of the this issue 
> https://issues.apache.org/jira/browse/SPARK-33398 that fixed an exception 
> when loading a model in Spark 3 that trained in Spark2.  After incorporating 
> this fix in my project, I ran into another issue which was introduced in the 
> fix [https://github.com/apache/spark/pull/30889/files.]  
> While loading my random forest model which was trained in Spark 2.2, I ran 
> into the following exception:
> {code:java}
> 16:03:34 ERROR Instrumentation:73 - java.lang.IllegalArgumentException: 
> nodeData does not exist. Available: treeid, nodedata
>  at 
> org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278)
>  at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:147)
>  at org.apache.spark.sql.types.StructType.apply(StructType.scala:277)
>  at 
> org.apache.spark.ml.tree.EnsembleModelReadWrite$.loadImpl(treeModels.scala:522)
>  at 
> org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:420)
>  at 
> org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:410)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$5(Pipeline.scala:277)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
>  at 
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:277)
>  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
>  at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
>  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
>  at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>  at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
>  at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274)
>  at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160)
>  at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155)
>  at 
> org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355)
>  at 
> org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191)
>  at scala.util.Try$.apply(Try.scala:213)
>  at 
> org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355)
>  at 
> org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349)
>  at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355)
>  at org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355)
>  at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337){code}
> When I looked at the data for the model, I see the schema is using 
> "*nodedata*" instead of "*nodeData*."  Here is what my model looks like:
> {code:java}
> +--+-+
> |treeid|nodedata  
>|
> +--+-+
> |12|{0, 1.0, 0.20578590428109744, [2492

[jira] [Commented] (SPARK-35129) Construct year-month interval column from integral fields

2021-05-18 Thread angerszhu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347271#comment-17347271
 ] 

angerszhu commented on SPARK-35129:
---

Can I take this if no update for a long time?

> Construct year-month interval column from integral fields
> -
>
> Key: SPARK-35129
> URL: https://issues.apache.org/jira/browse/SPARK-35129
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Create new function similar to make_interval() (or extend the make_interval() 
> function) which can construct YearMonthIntervalType values from the year, 
> month fields.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35438) Minor documentation fix for window physical operator

2021-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347219#comment-17347219
 ] 

Apache Spark commented on SPARK-35438:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/32585

> Minor documentation fix for window physical operator
> 
>
> Key: SPARK-35438
> URL: https://issues.apache.org/jira/browse/SPARK-35438
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Trivial
>
> As title. Fixed two places where the documentation has some error. Help 
> people read code more easily in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35438) Minor documentation fix for window physical operator

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35438:


Assignee: Apache Spark

> Minor documentation fix for window physical operator
> 
>
> Key: SPARK-35438
> URL: https://issues.apache.org/jira/browse/SPARK-35438
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Trivial
>
> As title. Fixed two places where the documentation has some error. Help 
> people read code more easily in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35438) Minor documentation fix for window physical operator

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35438:


Assignee: (was: Apache Spark)

> Minor documentation fix for window physical operator
> 
>
> Key: SPARK-35438
> URL: https://issues.apache.org/jira/browse/SPARK-35438
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Trivial
>
> As title. Fixed two places where the documentation has some error. Help 
> people read code more easily in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35438) Minor documentation fix for window physical operator

2021-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347218#comment-17347218
 ] 

Apache Spark commented on SPARK-35438:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/32585

> Minor documentation fix for window physical operator
> 
>
> Key: SPARK-35438
> URL: https://issues.apache.org/jira/browse/SPARK-35438
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Trivial
>
> As title. Fixed two places where the documentation has some error. Help 
> people read code more easily in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35438) Minor documentation fix for window physical operator

2021-05-18 Thread Cheng Su (Jira)

Cheng Su created SPARK-35438:


 Summary: Minor documentation fix for window physical operator
 Key: SPARK-35438
 URL: https://issues.apache.org/jira/browse/SPARK-35438
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Affects Versions: 3.2.0
Reporter: Cheng Su


As title. Fixed two places where the documentation has some error. Help people 
read code more easily in the future.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35305) Upgrade ZooKeeper to 3.7.0

2021-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35305.
---
Resolution: Later

According to the discussion on the PR, we already upgraded Netty library to 
remove those vulnerability. So, we don't need to upgrade Zookeeper.
- https://github.com/apache/spark/pull/32572#pullrequestreview-661653211

> Upgrade ZooKeeper to 3.7.0
> --
>
> Key: SPARK-35305
> URL: https://issues.apache.org/jira/browse/SPARK-35305
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Hari Prasad G
>Priority: Major
>
> Upgrade ZooKeeper to 3.7.0 to fix the vulnerabilities.
> *List of CVE's:*
>  * CVE-2021-21295
>  * CVE-2021-21290
>  * CVE-2021-21409
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35425) Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md

2021-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35425:
--
Target Version/s: 3.0.3, 3.1.2, 3.2.0  (was: 3.20)

> Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the 
> release README.md
> ---
>
> Key: SPARK-35425
> URL: https://issues.apache.org/jira/browse/SPARK-35425
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> SPARK-35375 confined the version of Jinja to <3.0.0.
> So it's good to note about it in docs/README.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35425) Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md

2021-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35425:
--
Issue Type: Bug  (was: Improvement)

> Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the 
> release README.md
> ---
>
> Key: SPARK-35425
> URL: https://issues.apache.org/jira/browse/SPARK-35425
> Project: Spark
>  Issue Type: Bug
>  Components: docs
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> SPARK-35375 confined the version of Jinja to <3.0.0.
> So it's good to note about it in docs/README.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35425) Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md

2021-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35425:
--
Affects Version/s: 3.0.2
   3.1.1

> Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the 
> release README.md
> ---
>
> Key: SPARK-35425
> URL: https://issues.apache.org/jira/browse/SPARK-35425
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> SPARK-35375 confined the version of Jinja to <3.0.0.
> So it's good to note about it in docs/README.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35425) Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md

2021-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35425:
--
Fix Version/s: 3.2.0
   3.1.2
   3.0.3

> Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the 
> release README.md
> ---
>
> Key: SPARK-35425
> URL: https://issues.apache.org/jira/browse/SPARK-35425
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
>
> SPARK-35375 confined the version of Jinja to <3.0.0.
> So it's good to note about it in docs/README.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35425) Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md

2021-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35425:
--
Summary: Pin jinja2 in spark-rm/Dockerfile and add as a required dependency 
in the release README.md  (was: Add note about Jinja2 as a required dependency 
for document build.)

> Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the 
> release README.md
> ---
>
> Key: SPARK-35425
> URL: https://issues.apache.org/jira/browse/SPARK-35425
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> SPARK-35375 confined the version of Jinja to <3.0.0.
> So it's good to note about it in docs/README.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0

2021-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35434:
--
Affects Version/s: (was: 3.20)
   3.2.0

> Upgrade scalatestplus artifacts to 3.2.9.0
> --
>
> Key: SPARK-35434
> URL: https://issues.apache.org/jira/browse/SPARK-35434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> scalatestplus artifacts seem to be renamed and their latest release supports 
> dotty.
> * https://github.com/scalatest/scalatestplus-scalacheck
> * https://github.com/scalatest/scalatestplus-mockito
> * https://github.com/scalatest/scalatestplus-selenium



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0

2021-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35434:
--
Parent: SPARK-25075
Issue Type: Sub-task  (was: Improvement)

> Upgrade scalatestplus artifacts to 3.2.9.0
> --
>
> Key: SPARK-35434
> URL: https://issues.apache.org/jira/browse/SPARK-35434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.20
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> scalatestplus artifacts seem to be renamed and their latest release supports 
> dotty.
> * https://github.com/scalatest/scalatestplus-scalacheck
> * https://github.com/scalatest/scalatestplus-mockito
> * https://github.com/scalatest/scalatestplus-selenium



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0

2021-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35434.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32581
[https://github.com/apache/spark/pull/32581]

> Upgrade scalatestplus artifacts to 3.2.9.0
> --
>
> Key: SPARK-35434
> URL: https://issues.apache.org/jira/browse/SPARK-35434
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.20
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> scalatestplus artifacts seem to be renamed and their latest release supports 
> dotty.
> * https://github.com/scalatest/scalatestplus-scalacheck
> * https://github.com/scalatest/scalatestplus-mockito
> * https://github.com/scalatest/scalatestplus-selenium



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35437) Hive partition filtering client optimization

2021-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347017#comment-17347017
 ] 

Apache Spark commented on SPARK-35437:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/32583

> Hive partition filtering client optimization
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Priority: Minor
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35437) Hive partition filtering client optimization

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35437:


Assignee: Apache Spark

> Hive partition filtering client optimization
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Assignee: Apache Spark
>Priority: Minor
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35437) Hive partition filtering client optimization

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35437:


Assignee: (was: Apache Spark)

> Hive partition filtering client optimization
> 
>
> Key: SPARK-35437
> URL: https://issues.apache.org/jira/browse/SPARK-35437
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: dzcxzl
>Priority: Minor
>
> When we have a table with a lot of partitions and there is no way to filter 
> it on the MetaStore Server, we will get all the partition details and filter 
> it on the client side. This is slow and puts a lot of pressure on the 
> MetaStore Server.
> We can first pull all the partition names, filter by expressions, and then 
> obtain detailed information about the corresponding partitions from the 
> MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34734) Update sbt version to 1.4.9

2021-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34734:
--
Parent: SPARK-25075
Issue Type: Sub-task  (was: Improvement)

> Update sbt version to 1.4.9
> ---
>
> Key: SPARK-34734
> URL: https://issues.apache.org/jira/browse/SPARK-34734
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: William Hyun
>Assignee: William Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34959) Upgrade SBT to 1.5.0

2021-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34959:
--
Parent: SPARK-25075
Issue Type: Sub-task  (was: Improvement)

> Upgrade SBT to 1.5.0
> 
>
> Key: SPARK-34959
> URL: https://issues.apache.org/jira/browse/SPARK-34959
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>
> This JIRA issue aims to upgrade SBT to 1.5.0 which has built-in Scala 3 
> support.
> https://github.com/sbt/sbt/releases/tag/v1.5.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35437) Hive partition filtering client optimization

2021-05-18 Thread dzcxzl (Jira)

dzcxzl created SPARK-35437:
--

 Summary: Hive partition filtering client optimization
 Key: SPARK-35437
 URL: https://issues.apache.org/jira/browse/SPARK-35437
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.1
Reporter: dzcxzl


When we have a table with a lot of partitions and there is no way to filter it 
on the MetaStore Server, we will get all the partition details and filter it on 
the client side. This is slow and puts a lot of pressure on the MetaStore 
Server.
We can first pull all the partition names, filter by expressions, and then 
obtain detailed information about the corresponding partitions from the 
MetaStore Server.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35411) Essential information missing in TreeNode json string

2021-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35411.
-
Fix Version/s: 3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 32557
[https://github.com/apache/spark/pull/32557]

> Essential information missing in TreeNode json string
> -
>
> Key: SPARK-35411
> URL: https://issues.apache.org/jira/browse/SPARK-35411
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: huangtengfei
>Assignee: huangtengfei
>Priority: Minor
> Fix For: 3.2.0, 3.1.2
>
>
> TreeNode can be serialized to json string with the method toJSON() or 
> prettyJson(). To avoid OOM issues, 
> [SPARK-17426|https://issues.apache.org/jira/browse/SPARK-17426] only keep 
> part of Seq data that can be written out to result json string.
>  Essential data like 
> [cteRelations|https://github.com/apache/spark/blob/v3.1.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L497]
>  in node With, 
> [branches|https://github.com/apache/spark/blob/v3.1.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L123]
>  in CaseWhen will be skipped and written out as null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35411) Essential information missing in TreeNode json string

2021-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35411:
---

Assignee: huangtengfei

> Essential information missing in TreeNode json string
> -
>
> Key: SPARK-35411
> URL: https://issues.apache.org/jira/browse/SPARK-35411
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: huangtengfei
>Assignee: huangtengfei
>Priority: Minor
>
> TreeNode can be serialized to json string with the method toJSON() or 
> prettyJson(). To avoid OOM issues, 
> [SPARK-17426|https://issues.apache.org/jira/browse/SPARK-17426] only keep 
> part of Seq data that can be written out to result json string.
>  Essential data like 
> [cteRelations|https://github.com/apache/spark/blob/v3.1.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L497]
>  in node With, 
> [branches|https://github.com/apache/spark/blob/v3.1.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L123]
>  in CaseWhen will be skipped and written out as null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35436) RocksDBFileManager - save checkpoint to DFS

2021-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346978#comment-17346978
 ] 

Apache Spark commented on SPARK-35436:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/32582

> RocksDBFileManager - save checkpoint to DFS
> ---
>
> Key: SPARK-35436
> URL: https://issues.apache.org/jira/browse/SPARK-35436
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Priority: Major
>
> The implementation for the save operation of RocksDBFileManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35436) RocksDBFileManager - save checkpoint to DFS

2021-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346976#comment-17346976
 ] 

Apache Spark commented on SPARK-35436:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/32582

> RocksDBFileManager - save checkpoint to DFS
> ---
>
> Key: SPARK-35436
> URL: https://issues.apache.org/jira/browse/SPARK-35436
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Priority: Major
>
> The implementation for the save operation of RocksDBFileManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35436) RocksDBFileManager - save checkpoint to DFS

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35436:


Assignee: Apache Spark

> RocksDBFileManager - save checkpoint to DFS
> ---
>
> Key: SPARK-35436
> URL: https://issues.apache.org/jira/browse/SPARK-35436
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> The implementation for the save operation of RocksDBFileManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35436) RocksDBFileManager - save checkpoint to DFS

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35436:


Assignee: (was: Apache Spark)

> RocksDBFileManager - save checkpoint to DFS
> ---
>
> Key: SPARK-35436
> URL: https://issues.apache.org/jira/browse/SPARK-35436
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Yuanjian Li
>Priority: Major
>
> The implementation for the save operation of RocksDBFileManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35079) Transform with udf gives incorrect result

2021-05-18 Thread koert kuipers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346970#comment-17346970
 ] 

koert kuipers commented on SPARK-35079:
---

looks to me like this is a duplicate of SPARK-34829

> Transform with udf gives incorrect result
> -
>
> Key: SPARK-35079
> URL: https://issues.apache.org/jira/browse/SPARK-35079
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: koert kuipers
>Priority: Minor
> Fix For: 3.1.2, 3.2.0
>
>
> i think this is a correctness bug in spark 3.1.1
> the behavior is correct in spark 3.0.1
> in spark 3.0.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [a, b, c]|
> +---+
> {code}
> in spark 3.1.1:
> {code:java}
> scala> import spark.implicits._
> scala> import org.apache.spark.sql.functions._
> scala> val x = Seq(Seq("aa", "bb", "cc")).toDF
> x: org.apache.spark.sql.DataFrame = [value: array]
> scala> x.select(transform(col("value"), col => udf((_: 
> String).drop(1)).apply(col))).show
> +---+
> |transform(value, lambdafunction(UDF(lambda 'x), x))|
> +---+
> |  [c, c, c]|
> +---+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35436) RocksDBFileManager - save checkpoint to DFS

2021-05-18 Thread Yuanjian Li (Jira)

Yuanjian Li created SPARK-35436:
---

 Summary: RocksDBFileManager - save checkpoint to DFS
 Key: SPARK-35436
 URL: https://issues.apache.org/jira/browse/SPARK-35436
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: Yuanjian Li


The implementation for the save operation of RocksDBFileManager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35435) Not able to capture driver and executor logs in console

2021-05-18 Thread Sugumar (Jira)

Sugumar created SPARK-35435:
---

 Summary: Not able to capture driver and executor logs in console
 Key: SPARK-35435
 URL: https://issues.apache.org/jira/browse/SPARK-35435
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.2.0
 Environment: Production
Reporter: Sugumar
 Fix For: 2.2.0


Running the spark job using the below command, but not able to capture the 
driver and executor logs in console
  # {{# Run on a Spark standalone cluster in cluster deploy mode with 
supervise}}
 # {{./bin/spark-submit \  --class org.apache.spark.examples.SparkPi \  
--master spark://207.184.161.138:7077 \  --deploy-mode cluster \  --supervise \ 
 --executor-memory 20G \  --total-executor-cores 100 \  /path/to/examples.jar \ 
 1000}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35434:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Upgrade scalatestplus artifacts to 3.2.9.0
> --
>
> Key: SPARK-35434
> URL: https://issues.apache.org/jira/browse/SPARK-35434
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.20
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> scalatestplus artifacts seem to be renamed and their latest release supports 
> dotty.
> * https://github.com/scalatest/scalatestplus-scalacheck
> * https://github.com/scalatest/scalatestplus-mockito
> * https://github.com/scalatest/scalatestplus-selenium



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0

2021-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346888#comment-17346888
 ] 

Apache Spark commented on SPARK-35434:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32581

> Upgrade scalatestplus artifacts to 3.2.9.0
> --
>
> Key: SPARK-35434
> URL: https://issues.apache.org/jira/browse/SPARK-35434
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.20
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> scalatestplus artifacts seem to be renamed and their latest release supports 
> dotty.
> * https://github.com/scalatest/scalatestplus-scalacheck
> * https://github.com/scalatest/scalatestplus-mockito
> * https://github.com/scalatest/scalatestplus-selenium



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0

2021-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35434:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Upgrade scalatestplus artifacts to 3.2.9.0
> --
>
> Key: SPARK-35434
> URL: https://issues.apache.org/jira/browse/SPARK-35434
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.20
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> scalatestplus artifacts seem to be renamed and their latest release supports 
> dotty.
> * https://github.com/scalatest/scalatestplus-scalacheck
> * https://github.com/scalatest/scalatestplus-mockito
> * https://github.com/scalatest/scalatestplus-selenium



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0

2021-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346887#comment-17346887
 ] 

Apache Spark commented on SPARK-35434:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32581

> Upgrade scalatestplus artifacts to 3.2.9.0
> --
>
> Key: SPARK-35434
> URL: https://issues.apache.org/jira/browse/SPARK-35434
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.20
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> scalatestplus artifacts seem to be renamed and their latest release supports 
> dotty.
> * https://github.com/scalatest/scalatestplus-scalacheck
> * https://github.com/scalatest/scalatestplus-mockito
> * https://github.com/scalatest/scalatestplus-selenium



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0

2021-05-18 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-35434:
---
Summary: Upgrade scalatestplus artifacts to 3.2.9.0  (was: Upgrade 
scalatestplus artifacts 3.2.9.0)

> Upgrade scalatestplus artifacts to 3.2.9.0
> --
>
> Key: SPARK-35434
> URL: https://issues.apache.org/jira/browse/SPARK-35434
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.20
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> scalatestplus artifacts seem to be renamed and their latest release supports 
> dotty.
> * https://github.com/scalatest/scalatestplus-scalacheck
> * https://github.com/scalatest/scalatestplus-mockito
> * https://github.com/scalatest/scalatestplus-selenium



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35434) Upgrade scalatestplus artifacts 3.2.9.0

2021-05-18 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-35434:
--

 Summary: Upgrade scalatestplus artifacts 3.2.9.0
 Key: SPARK-35434
 URL: https://issues.apache.org/jira/browse/SPARK-35434
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.20
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


scalatestplus artifacts seem to be renamed and their latest release supports 
dotty.

* https://github.com/scalatest/scalatestplus-scalacheck
* https://github.com/scalatest/scalatestplus-mockito
* https://github.com/scalatest/scalatestplus-selenium



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35433) Move CSV data source options from Python and Scala into a single page.

2021-05-18 Thread Haejoon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haejoon Lee updated SPARK-35433:

Description: Refer to https://issues.apache.org/jira/browse/SPARK-34491

> Move CSV data source options from Python and Scala into a single page.
> --
>
> Key: SPARK-35433
> URL: https://issues.apache.org/jira/browse/SPARK-35433
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Refer to https://issues.apache.org/jira/browse/SPARK-34491



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-35433) Move CSV data source options from Python and Scala into a single page.

2021-05-18 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-35433:
---

 Summary: Move CSV data source options from Python and Scala into a 
single page.
 Key: SPARK-35433
 URL: https://issues.apache.org/jira/browse/SPARK-35433
 Project: Spark
  Issue Type: Sub-task
  Components: docs
Affects Versions: 3.2.0
Reporter: Haejoon Lee






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18683) REST APIs for standalone Master、Workers and Applications

2021-05-18 Thread Mayank Asthana (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-18683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346739#comment-17346739
 ] 

Mayank Asthana commented on SPARK-18683:


Can we reopen this? My usecase is to have spark streaming applications running 
all the time, and we want to check through the REST API if these applications 
are still running and restart them if they are not.

> REST APIs for standalone Master、Workers and Applications
> 
>
> Key: SPARK-18683
> URL: https://issues.apache.org/jira/browse/SPARK-18683
> Project: Spark
>  Issue Type: Improvement
>Reporter: Shixiong Zhu
>Priority: Major
>  Labels: bulk-closed
>
> It would be great that we have some REST APIs to access Master、Workers and 
> Applications information. Right now the only way to get them is using the Web 
> UI.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35425) Add note about Jinja2 as a required dependency for document build.

2021-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346717#comment-17346717
 ] 

Apache Spark commented on SPARK-35425:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32580

> Add note about Jinja2 as a required dependency for document build.
> --
>
> Key: SPARK-35425
> URL: https://issues.apache.org/jira/browse/SPARK-35425
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> SPARK-35375 confined the version of Jinja to <3.0.0.
> So it's good to note about it in docs/README.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35422) Many test cases failed in Scala 2.13 CI

2021-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35422.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32577
[https://github.com/apache/spark/pull/32577]

> Many test cases failed in Scala 2.13 CI
> ---
>
> Key: SPARK-35422
> URL: https://issues.apache.org/jira/browse/SPARK-35422
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.2.0
>
>
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/]
>  
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   
> [org.apache.spark.sql.SQLQueryTestSuite.subquery/scalar-subquery/scalar-subquery-select.sql|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/]|2.4
>  
> sec|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q46_/]|59
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q53_/]|62
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q63)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q63_/]|54
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q68)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q68_/]|50
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q73)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q73_/]|58
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check 
> simplified sf100 
> (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWithStatsSuite/check_simplified_sf100__tpcds_modifiedQueries_q46_/]|62
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check 
> simplified sf100 
> (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWith

[jira] [Assigned] (SPARK-35422) Many test cases failed in Scala 2.13 CI

2021-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35422:
---

Assignee: Takeshi Yamamuro

> Many test cases failed in Scala 2.13 CI
> ---
>
> Key: SPARK-35422
> URL: https://issues.apache.org/jira/browse/SPARK-35422
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/]
>  
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   
> [org.apache.spark.sql.SQLQueryTestSuite.subquery/scalar-subquery/scalar-subquery-select.sql|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/]|2.4
>  
> sec|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q46_/]|59
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q53_/]|62
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q63)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q63_/]|54
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q68)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q68_/]|50
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified 
> (tpcds-modifiedQueries/q73)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q73_/]|58
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check 
> simplified sf100 
> (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWithStatsSuite/check_simplified_sf100__tpcds_modifiedQueries_q46_/]|62
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]|
> |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png!
>   [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check 
> simplified sf100 
> (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWithStatsSuite/check_simplified_sf100__tpcds_modifiedQueries_q53_/]|57
>  
> ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-

[jira] [Assigned] (SPARK-35389) Analyzer should set progagateNull to false for magic function invocation

2021-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35389:
---

Assignee: Chao Sun

> Analyzer should set progagateNull to false for magic function invocation
> 
>
> Key: SPARK-35389
> URL: https://issues.apache.org/jira/browse/SPARK-35389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> For both {{Invoke}} and {{StaticInvoke}} used by magic method of 
> {{ScalarFunction}}, we should set {{propgateNull}} to false, so that null 
> values will be passed to the UDF for evaluation, instead of bypassing that 
> and directly return null. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35389) Analyzer should set progagateNull to false for magic function invocation

2021-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35389.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32553
[https://github.com/apache/spark/pull/32553]

> Analyzer should set progagateNull to false for magic function invocation
> 
>
> Key: SPARK-35389
> URL: https://issues.apache.org/jira/browse/SPARK-35389
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> For both {{Invoke}} and {{StaticInvoke}} used by magic method of 
> {{ScalarFunction}}, we should set {{propgateNull}} to false, so that null 
> values will be passed to the UDF for evaluation, instead of bypassing that 
> and directly return null. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35089) non consistent results running count for same dataset after filter and lead window function

2021-05-18 Thread Domagoj (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Domagoj updated SPARK-35089:

Description: 
   edit 2021-05-18

I have make it  simpler to reproduce; I've put already generated data on s3 
bucket that is publicly available with 24.000.000 records

Now all you need to do is run this code:
{code:java}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql._
import org.apache.spark.sql.functions._

val w = Window.partitionBy("user").orderBy("start")
val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000))

spark.read.orc("s3://dtonzetic-spark-sample-data/sample-data.orc").
 withColumn("end", ts_lead).
 withColumn("duration", col("end")-col("start")).
 where("type='TypeA' and duration>4").count()
{code}
 

this were my results:
 - run 1: 2547559
 - run 2: 2547559
 - run 3: 2547560
 - run 4: 2547558
 - run 5: 2547558
 - run 6: 2547559
 - run 7: 2547558

This results are from new EMR cluster, version 6.3.0, so nothing changed.

   end edit 2021-05-18

I have found an inconsistency with count function results after lead window 
function and filter.

 

I have a dataframe (this is simplified version, but it's enough to reproduce) 
with millions of records, with these columns:
 * df1:
 ** start(timestamp)
 ** user_id(int)
 ** type(string)

I need to define duration between two rows, and filter on that duration and 
type. I used window lead function to get the next event time (that define end 
for current event), so every row now gets start and stop times. If NULL (last 
row for example), add next midnight as stop. Data is stored in ORC file (tried 
with Parquet format, no difference)

This only happens with multiple cluster nodes, for example AWS EMR cluster or 
local docker cluster setup. If I run it on single instance (local on laptop), I 
get consistent results every time. Spark version is 3.0.1, both in AWS and 
local and docker setup.

Here is some simple code that you can use to reproduce it, I've used jupyterLab 
notebook on AWS EMR. Spark version is 3.0.1.

 

 
{code:java}
import org.apache.spark.sql.expressions.Window

// this dataframe generation code should be executed only once, and data have 
to be saved, and then opened from disk, so it's always same.

val getRandomUser = udf(()=>{
val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy")
   users(scala.util.Random.nextInt(7))
})

val getRandomType = udf(()=>{
val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE")
types(scala.util.Random.nextInt(5))
})

val getRandomStart = udf((x:Int)=>{
x+scala.util.Random.nextInt(47)
})
// for loop is used to avoid out of memory error during creation of dataframe
for( a <- 0 to 23){
// use iterator a to continue with next million, repeat 1 mil times
val x=Range(a*100,(a*100)+100).toDF("id").
withColumn("start",getRandomStart(col("id"))).
withColumn("user",getRandomUser()).
withColumn("type",getRandomType()).
drop("id")

x.write.mode("append").orc("hdfs:///random.orc")
}

// above code should be run only once, I used a cell in Jupyter

// define window and lead
val w = Window.partitionBy("user").orderBy("start")
// if null, replace with 30.000.000
val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000))

// read data to dataframe, create stop column and calculate duration
val fox2 = spark.read.orc("hdfs:///random.orc").
withColumn("end", ts_lead).
withColumn("duration", col("end")-col("start"))


// repeated executions of this line returns different results for count 
// I have it in separate cell in JupyterLab
fox2.where("type='TypeA' and duration>4").count()
{code}
My results for three consecutive runs of last line were:
 * run 1: 2551259
 * run 2: 2550756
 * run 3: 2551279

It's very important to say that if I use filter:

fox2.where("type='TypeA' ")

or 

fox2.where("duration>4"),

 

each of them can be executed repeatedly and I get consistent result every time.

I can save dataframe after crating stop and duration columns, and after that, I 
get consistent results every time.

It is not very practical workaround, as I need a lot of space and time to 
implement it.

This dataset is really big (in my eyes at least, aprox 100.000.000 new records 
per day).

If I run this same example on my local machine using master = local[*], 
everything works as expected, it's just on cluster setup. I tried to create 
cluster using docker on my local machine, created 3.0.1 and 3.1.1 clusters with 
one master and two workers, and have successfully reproduced issue.

 

 

 

 

 

  was:
   edit 2021-05-18

I have make it  simpler to reproduce; I've put already generated data on s3 
bucket that is publicly available with 24.000.000 records

Now all you need to do is run this code:
{code:java}
import org.apache.spark.sql.expressions.Window
import org.a

[jira] [Updated] (SPARK-35089) non consistent results running count for same dataset after filter and lead window function

2021-05-18 Thread Domagoj (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Domagoj updated SPARK-35089:

Description: 
   edit 2021-05-18

I have make it  simpler to reproduce; I've put already generated data on s3 
bucket that is publicly available with 24.000.000 records

Now all you need to do is run this code:
{code:java}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val w = Window.partitionBy("user").orderBy("start")
val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000))
spark.read.orc("s3://dtonzetic-spark-sample-data/sample-data.orc").
 withColumn("end", ts_lead).
 withColumn("duration", col("end")-col("start")).
 where("type='TypeA' and duration>4").count()
{code}
 

this were my results:
 - run 1: 2547559
 - run 2: 2547559
 - run 3: 2547560
 - run 4: 2547558
 - run 5: 2547558
 - run 6: 2547559
 - run 7: 2547558

This results are from new EMR cluster, version 6.3.0, so nothing changed.

   end edit 2021-05-18

I have found an inconsistency with count function results after lead window 
function and filter.

 

I have a dataframe (this is simplified version, but it's enough to reproduce) 
with millions of records, with these columns:
 * df1:
 ** start(timestamp)
 ** user_id(int)
 ** type(string)

I need to define duration between two rows, and filter on that duration and 
type. I used window lead function to get the next event time (that define end 
for current event), so every row now gets start and stop times. If NULL (last 
row for example), add next midnight as stop. Data is stored in ORC file (tried 
with Parquet format, no difference)

This only happens with multiple cluster nodes, for example AWS EMR cluster or 
local docker cluster setup. If I run it on single instance (local on laptop), I 
get consistent results every time. Spark version is 3.0.1, both in AWS and 
local and docker setup.

Here is some simple code that you can use to reproduce it, I've used jupyterLab 
notebook on AWS EMR. Spark version is 3.0.1.

 

 
{code:java}
import org.apache.spark.sql.expressions.Window

// this dataframe generation code should be executed only once, and data have 
to be saved, and then opened from disk, so it's always same.

val getRandomUser = udf(()=>{
val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy")
   users(scala.util.Random.nextInt(7))
})

val getRandomType = udf(()=>{
val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE")
types(scala.util.Random.nextInt(5))
})

val getRandomStart = udf((x:Int)=>{
x+scala.util.Random.nextInt(47)
})
// for loop is used to avoid out of memory error during creation of dataframe
for( a <- 0 to 23){
// use iterator a to continue with next million, repeat 1 mil times
val x=Range(a*100,(a*100)+100).toDF("id").
withColumn("start",getRandomStart(col("id"))).
withColumn("user",getRandomUser()).
withColumn("type",getRandomType()).
drop("id")

x.write.mode("append").orc("hdfs:///random.orc")
}

// above code should be run only once, I used a cell in Jupyter

// define window and lead
val w = Window.partitionBy("user").orderBy("start")
// if null, replace with 30.000.000
val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000))

// read data to dataframe, create stop column and calculate duration
val fox2 = spark.read.orc("hdfs:///random.orc").
withColumn("end", ts_lead).
withColumn("duration", col("end")-col("start"))


// repeated executions of this line returns different results for count 
// I have it in separate cell in JupyterLab
fox2.where("type='TypeA' and duration>4").count()
{code}
My results for three consecutive runs of last line were:
 * run 1: 2551259
 * run 2: 2550756
 * run 3: 2551279

It's very important to say that if I use filter:

fox2.where("type='TypeA' ")

or 

fox2.where("duration>4"),

 

each of them can be executed repeatedly and I get consistent result every time.

I can save dataframe after crating stop and duration columns, and after that, I 
get consistent results every time.

It is not very practical workaround, as I need a lot of space and time to 
implement it.

This dataset is really big (in my eyes at least, aprox 100.000.000 new records 
per day).

If I run this same example on my local machine using master = local[*], 
everything works as expected, it's just on cluster setup. I tried to create 
cluster using docker on my local machine, created 3.0.1 and 3.1.1 clusters with 
one master and two workers, and have successfully reproduced issue.

 

 

 

 

 

  was:
   edit 2021-05-18

I have make it  simpler to reproduce; I've put already generated data on s3 
bucket that is publicly available with 24.000.000 records

Now all you need to do is run this code:
{code:java}
import org.apache.spark.sql.expressions.Window
import org.apa

[jira] [Commented] (SPARK-35089) non consistent results running count for same dataset after filter and lead window function

2021-05-18 Thread Domagoj (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346704#comment-17346704
 ] 

Domagoj commented on SPARK-35089:
-

I've made test data available via public s3 bucket, so it's easier to reproduce 
now.

> non consistent results running count for same dataset after filter and lead 
> window function
> ---
>
> Key: SPARK-35089
> URL: https://issues.apache.org/jira/browse/SPARK-35089
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Domagoj
>Priority: Major
>
>    edit 2021-05-18
> I have make it  simpler to reproduce; I've put already generated data on s3 
> bucket that is publicly available with 24.000.000 records
> Now all you need to do is run this code:
> {code:java}
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql._
> import org.apache.spark.sql.functions._
> val w = Window.partitionBy("user").orderBy("start")
> val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000))
> spark.read.orc("s3://dtonzetic-spark-sample-data/sample-data.orc").
>  withColumn("end", ts_lead).
>  withColumn("duration", col("end")-col("start")).
>  where("type='TypeA' and duration>4").count()
> {code}
>  
> this were my results:
> - run 1: 2547559
> - run 2: 2547559
> - run 3: 2547560
> - run 4: 2547558
> - run 5: 2547558
> - run 6: 2547559
> - run 7: 2547558
>    end edit 2021-05-18
> I have found an inconsistency with count function results after lead window 
> function and filter.
>  
> I have a dataframe (this is simplified version, but it's enough to reproduce) 
> with millions of records, with these columns:
>  * df1:
>  ** start(timestamp)
>  ** user_id(int)
>  ** type(string)
> I need to define duration between two rows, and filter on that duration and 
> type. I used window lead function to get the next event time (that define end 
> for current event), so every row now gets start and stop times. If NULL (last 
> row for example), add next midnight as stop. Data is stored in ORC file 
> (tried with Parquet format, no difference)
> This only happens with multiple cluster nodes, for example AWS EMR cluster or 
> local docker cluster setup. If I run it on single instance (local on laptop), 
> I get consistent results every time. Spark version is 3.0.1, both in AWS and 
> local and docker setup.
> Here is some simple code that you can use to reproduce it, I've used 
> jupyterLab notebook on AWS EMR. Spark version is 3.0.1.
>  
>  
> {code:java}
> import org.apache.spark.sql.expressions.Window
> // this dataframe generation code should be executed only once, and data have 
> to be saved, and then opened from disk, so it's always same.
> val getRandomUser = udf(()=>{
> val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy")
>users(scala.util.Random.nextInt(7))
> })
> val getRandomType = udf(()=>{
> val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE")
> types(scala.util.Random.nextInt(5))
> })
> val getRandomStart = udf((x:Int)=>{
> x+scala.util.Random.nextInt(47)
> })
> // for loop is used to avoid out of memory error during creation of dataframe
> for( a <- 0 to 23){
> // use iterator a to continue with next million, repeat 1 mil times
> val x=Range(a*100,(a*100)+100).toDF("id").
> withColumn("start",getRandomStart(col("id"))).
> withColumn("user",getRandomUser()).
> withColumn("type",getRandomType()).
> drop("id")
> x.write.mode("append").orc("hdfs:///random.orc")
> }
> // above code should be run only once, I used a cell in Jupyter
> // define window and lead
> val w = Window.partitionBy("user").orderBy("start")
> // if null, replace with 30.000.000
> val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000))
> // read data to dataframe, create stop column and calculate duration
> val fox2 = spark.read.orc("hdfs:///random.orc").
> withColumn("end", ts_lead).
> withColumn("duration", col("end")-col("start"))
> // repeated executions of this line returns different results for count 
> // I have it in separate cell in JupyterLab
> fox2.where("type='TypeA' and duration>4").count()
> {code}
> My results for three consecutive runs of last line were:
>  * run 1: 2551259
>  * run 2: 2550756
>  * run 3: 2551279
> It's very important to say that if I use filter:
> fox2.where("type='TypeA' ")
> or 
> fox2.where("duration>4"),
>  
> each of them can be executed repeatedly and I get consistent result every 
> time.
> I can save dataframe after crating stop and duration columns, and after that, 
> I get consistent results every time.
> It is not very practical workaround, as I need a lot of space and time to 
> impl

[jira] [Updated] (SPARK-35089) non consistent results running count for same dataset after filter and lead window function

2021-05-18 Thread Domagoj (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Domagoj updated SPARK-35089:

Description: 
   edit 2021-05-18

I have make it  simpler to reproduce; I've put already generated data on s3 
bucket that is publicly available with 24.000.000 records

Now all you need to do is run this code:
{code:java}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val w = Window.partitionBy("user").orderBy("start")
val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000))
spark.read.orc("s3://dtonzetic-spark-sample-data/sample-data.orc").
 withColumn("end", ts_lead).
 withColumn("duration", col("end")-col("start")).
 where("type='TypeA' and duration>4").count()
{code}
 

this were my results:

- run 1: 2547559
- run 2: 2547559
- run 3: 2547560
- run 4: 2547558
- run 5: 2547558
- run 6: 2547559
- run 7: 2547558

   end edit 2021-05-18

I have found an inconsistency with count function results after lead window 
function and filter.

 

I have a dataframe (this is simplified version, but it's enough to reproduce) 
with millions of records, with these columns:
 * df1:
 ** start(timestamp)
 ** user_id(int)
 ** type(string)

I need to define duration between two rows, and filter on that duration and 
type. I used window lead function to get the next event time (that define end 
for current event), so every row now gets start and stop times. If NULL (last 
row for example), add next midnight as stop. Data is stored in ORC file (tried 
with Parquet format, no difference)

This only happens with multiple cluster nodes, for example AWS EMR cluster or 
local docker cluster setup. If I run it on single instance (local on laptop), I 
get consistent results every time. Spark version is 3.0.1, both in AWS and 
local and docker setup.

Here is some simple code that you can use to reproduce it, I've used jupyterLab 
notebook on AWS EMR. Spark version is 3.0.1.

 

 
{code:java}
import org.apache.spark.sql.expressions.Window

// this dataframe generation code should be executed only once, and data have 
to be saved, and then opened from disk, so it's always same.

val getRandomUser = udf(()=>{
val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy")
   users(scala.util.Random.nextInt(7))
})

val getRandomType = udf(()=>{
val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE")
types(scala.util.Random.nextInt(5))
})

val getRandomStart = udf((x:Int)=>{
x+scala.util.Random.nextInt(47)
})
// for loop is used to avoid out of memory error during creation of dataframe
for( a <- 0 to 23){
// use iterator a to continue with next million, repeat 1 mil times
val x=Range(a*100,(a*100)+100).toDF("id").
withColumn("start",getRandomStart(col("id"))).
withColumn("user",getRandomUser()).
withColumn("type",getRandomType()).
drop("id")

x.write.mode("append").orc("hdfs:///random.orc")
}

// above code should be run only once, I used a cell in Jupyter

// define window and lead
val w = Window.partitionBy("user").orderBy("start")
// if null, replace with 30.000.000
val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000))

// read data to dataframe, create stop column and calculate duration
val fox2 = spark.read.orc("hdfs:///random.orc").
withColumn("end", ts_lead).
withColumn("duration", col("end")-col("start"))


// repeated executions of this line returns different results for count 
// I have it in separate cell in JupyterLab
fox2.where("type='TypeA' and duration>4").count()
{code}
My results for three consecutive runs of last line were:
 * run 1: 2551259
 * run 2: 2550756
 * run 3: 2551279

It's very important to say that if I use filter:

fox2.where("type='TypeA' ")

or 

fox2.where("duration>4"),

 

each of them can be executed repeatedly and I get consistent result every time.

I can save dataframe after crating stop and duration columns, and after that, I 
get consistent results every time.

It is not very practical workaround, as I need a lot of space and time to 
implement it.

This dataset is really big (in my eyes at least, aprox 100.000.000 new records 
per day).

If I run this same example on my local machine using master = local[*], 
everything works as expected, it's just on cluster setup. I tried to create 
cluster using docker on my local machine, created 3.0.1 and 3.1.1 clusters with 
one master and two workers, and have successfully reproduced issue.

 

 

 

 

 

  was:
I have found an inconsistency with count function results after lead window 
function and filter.

 

I have a dataframe (this is simplified version, but it's enough to reproduce) 
with millions of records, with these columns:
 * df1:
 ** start(timestamp)
 ** user_id(int)
 ** type(string)

I need to define duration between two rows, and filter on that d

[jira] [Created] (SPARK-35432) Expose TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions via Scala, Python and R APIs

2021-05-18 Thread Max Gekk (Jira)

Max Gekk created SPARK-35432:


 Summary: Expose TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and 
TIMESTAMP_MICROS functions via Scala, Python and R APIs
 Key: SPARK-35432
 URL: https://issues.apache.org/jira/browse/SPARK-35432
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Max Gekk


The PR https://github.com/apache/spark/pull/28534 added new functions 
TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS but the functions are 
available to users only via SQL. To make other APIs (Scala/Java, PySpark and R) 
as powerful as SQL, need to implement the functions in the APIs too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35425) Add note about Jinja2 as a required dependency for document build.

2021-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346682#comment-17346682
 ] 

Apache Spark commented on SPARK-35425:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32579

> Add note about Jinja2 as a required dependency for document build.
> --
>
> Key: SPARK-35425
> URL: https://issues.apache.org/jira/browse/SPARK-35425
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> SPARK-35375 confined the version of Jinja to <3.0.0.
> So it's good to note about it in docs/README.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35351) Add code-gen for left anti sort merge join

2021-05-18 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-35351.
--
Fix Version/s: 3.2.0
 Assignee: Cheng Su
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32547

> Add code-gen for left anti sort merge join
> --
>
> Key: SPARK-35351
> URL: https://issues.apache.org/jira/browse/SPARK-35351
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> This Jira is to track the progress to add code-gen support for left anti sort 
> merge join. See motivation in SPARK-34705.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35256) Subexpression elimination leading to a performance regression

2021-05-18 Thread Ondrej Kokes (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ondrej Kokes updated SPARK-35256:
-
Summary: Subexpression elimination leading to a performance regression  
(was: str_to_map + split performance regression)

> Subexpression elimination leading to a performance regression
> -
>
> Key: SPARK-35256
> URL: https://issues.apache.org/jira/browse/SPARK-35256
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Ondrej Kokes
>Priority: Minor
> Attachments: bisect_log.txt, bisect_timing.csv
>
>
> I'm seeing almost double the runtime between 3.0.1 and 3.1.1 in my pipeline 
> that does mostly str_to_map, split and a few other operations - all 
> projections, no joins or aggregations (it's here only to trigger the 
> pipeline). I cut it down to the simplest reproducible example I could - 
> anything I remove from this changes the runtime difference quite 
> dramatically. (even moving those two expressions from f.when to standalone 
> columns makes the difference disappear)
> {code:java}
> import time
> import os
> import pyspark  
> from pyspark.sql import SparkSession
> import pyspark.sql.functions as f
> if __name__ == '__main__':
> print(pyspark.__version__)
> spark = SparkSession.builder.getOrCreate()
> filename = 'regression.csv'
> if not os.path.isfile(filename):
> with open(filename, 'wt') as fw:
> fw.write('foo\n')
> for _ in range(10_000_000):
> fw.write('foo=bar&baz=bak&bar=f,o,1:2:3\n')
> df = spark.read.option('header', True).csv(filename)
> t = time.time()
> dd = (df
> .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")'))
> .withColumn('extracted',
> # without this top level split it is only 50% 
> slower, with it
> # the runtime almost doubles
> f.split(f.split(f.col("my_map")["bar"], ",")[2], 
> ":")[0]
>)
> .select(
> f.when(
> f.col("extracted").startswith("foo"), f.col("extracted")
> ).otherwise(
> f.concat(f.lit("foo"), f.col("extracted"))
> ).alias("foo")
> )
> )
> # dd.explain(True)
> _ = dd.groupby("foo").count().count()
> print("elapsed", time.time() - t)
> {code}
> Running this in 3.0.1 and 3.1.1 respectively (both installed from PyPI, on my 
> local macOS)
> {code:java}
> 3.0.1
> elapsed 21.262351036071777
> 3.1.1
> elapsed 40.26582884788513
> {code}
> (Meaning the transformation took 21 seconds in 3.0.1 and 40 seconds in 3.1.1)
> Feel free to make the CSV smaller to get a quicker feedback loop - it scales 
> linearly (I developed this with 2M rows).
> It might be related to my previous issue - SPARK-32989 - there are similar 
> operations, nesting etc. (splitting on the original column, not on a map, 
> makes the difference disappear)
> I tried dissecting the queries in SparkUI and via explain, but both 3.0.1 and 
> 3.1.1 produced identical plans.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35420) Replace the usage of toStringHelper with ToStringBuilder

2021-05-18 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-35420:
---
Fix Version/s: 3.2.0

> Replace the usage of toStringHelper with ToStringBuilder
> 
>
> Key: SPARK-35420
> URL: https://issues.apache.org/jira/browse/SPARK-35420
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> SPARK-30272 removed the usage of Guava that breaks in 27 but toStringHelper 
> is introduced again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-35425) Add note about Jinja2 as a required dependency for document build.

2021-05-18 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-35425.

Target Version/s: 3.20
  Resolution: Fixed

This issue was resolved in https://github.com/apache/spark/pull/32573.

> Add note about Jinja2 as a required dependency for document build.
> --
>
> Key: SPARK-35425
> URL: https://issues.apache.org/jira/browse/SPARK-35425
> Project: Spark
>  Issue Type: Improvement
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> SPARK-35375 confined the version of Jinja to <3.0.0.
> So it's good to note about it in docs/README.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

78 matches

Mail list logo