[jira] [Commented] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method
[ https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347334#comment-17347334 ] Wenchen Fan commented on SPARK-34674: - This has been reverted. [~dongjoon] is there any context? > Spark app on k8s doesn't terminate without call to sparkContext.stop() method > - > > Key: SPARK-34674 > URL: https://issues.apache.org/jira/browse/SPARK-34674 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Sergey Kotlov >Assignee: Sergey Kotlov >Priority: Major > Fix For: 3.1.2, 3.2.0 > > > Hello! > I have run into a problem that if I don't call the method > sparkContext.stop() explicitly, then a Spark driver process doesn't terminate > even after its Main method has been completed. This behaviour is different > from spark on yarn, where the manual sparkContext stopping is not required. > It looks like, the problem is in using non-daemon threads, which prevent the > driver jvm process from terminating. > At least I see two non-daemon threads, if I don't call sparkContext.stop(): > {code:java} > Thread[OkHttp kubernetes.default.svc,5,main] > Thread[OkHttp kubernetes.default.svc Writer,5,main] > {code} > Could you tell please, if it is possible to solve this problem? > Docker image from the official release of spark-3.1.1 hadoop3.2 is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method
[ https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-34674: Comment: was deleted (was: This has been reverted. [~dongjoon] is there any context?) > Spark app on k8s doesn't terminate without call to sparkContext.stop() method > - > > Key: SPARK-34674 > URL: https://issues.apache.org/jira/browse/SPARK-34674 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.1.1 >Reporter: Sergey Kotlov >Assignee: Sergey Kotlov >Priority: Major > Fix For: 3.1.2, 3.2.0 > > > Hello! > I have run into a problem that if I don't call the method > sparkContext.stop() explicitly, then a Spark driver process doesn't terminate > even after its Main method has been completed. This behaviour is different > from spark on yarn, where the manual sparkContext stopping is not required. > It looks like, the problem is in using non-daemon threads, which prevent the > driver jvm process from terminating. > At least I see two non-daemon threads, if I don't call sparkContext.stop(): > {code:java} > Thread[OkHttp kubernetes.default.svc,5,main] > Thread[OkHttp kubernetes.default.svc Writer,5,main] > {code} > Could you tell please, if it is possible to solve this problem? > Docker image from the official release of spark-3.1.1 hadoop3.2 is used. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35421) Remove redundant ProjectExec from streaming queries with V2Relation
[ https://issues.apache.org/jira/browse/SPARK-35421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35421. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32570 [https://github.com/apache/spark/pull/32570] > Remove redundant ProjectExec from streaming queries with V2Relation > --- > > Key: SPARK-35421 > URL: https://issues.apache.org/jira/browse/SPARK-35421 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.2.0 > > > Streaming queries with V2Relation can have redundant ProjectExec in it's > physical plan. > You can easily reproduce with the following code. > {code} > import org.apache.spark.sql.streaming.Trigger > val query = spark. > readStream. > format("rate"). > option("rowsPerSecond", 1000). > option("rampUpTime", "10s"). > load(). > selectExpr("timestamp", "100", "value"). > writeStream. > format("console"). > trigger(Trigger.ProcessingTime("5 seconds")). > // trigger(Trigger.Continuous("5 seconds")). // You can reproduce with > continuous processing too. > outputMode("append"). > start() > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-35439: Description: EquivalentExpressions maintains a map of equivalent expressions. It is HashMap now so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation. For example, we have two different expressions Add(Literal(1), Literal(2)) and Add(Literal(3), add). Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can deal with it. addExprTree(add) addExprTree(Add(Literal(3), add)) addExprTree(Add(Literal(3), add)) Case 2: parent subexpr comes first. For this case, we need to sort equivalent expressions. addExprTree(Add(Literal(3), add)) => We add `Add(Literal(3), add)` into the map first, then add `add` into the map addExprTree(add) addExprTree(Add(Literal(3), add)) was: EquivalentExpressions maintains a map of equivalent expressions. It is HashMap now so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation. Although we add expressions recursively into the map with depth-first approach, when we retrieve the map values, it is not guaranteed that the order is preserved. We should use LinkedHashMap for this usage. > Children subexpr should come first than parent subexpr in subexpression > elimination > --- > > Key: SPARK-35439 > URL: https://issues.apache.org/jira/browse/SPARK-35439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Minor > > EquivalentExpressions maintains a map of equivalent expressions. It is > HashMap now so the insertion order is not guaranteed to be preserved later. > Subexpression elimination relies on retrieving subexpressions from the map. > If there is child-parent relationships among the subexpressions, we want the > child expressions come first than parent expressions, so we can replace child > expressions in parent expressions with subexpression evaluation. > For example, we have two different expressions Add(Literal(1), Literal(2)) > and Add(Literal(3), add). > Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can > deal with it. > addExprTree(add) > addExprTree(Add(Literal(3), add)) > addExprTree(Add(Literal(3), add)) > Case 2: parent subexpr comes first. For this case, we need to sort equivalent > expressions. > addExprTree(Add(Literal(3), add)) => We add `Add(Literal(3), add)` into the > map first, then add `add` into the map > addExprTree(add) > addExprTree(Add(Literal(3), add)) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-35439: Priority: Major (was: Minor) > Children subexpr should come first than parent subexpr in subexpression > elimination > --- > > Key: SPARK-35439 > URL: https://issues.apache.org/jira/browse/SPARK-35439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > EquivalentExpressions maintains a map of equivalent expressions. It is > HashMap now so the insertion order is not guaranteed to be preserved later. > Subexpression elimination relies on retrieving subexpressions from the map. > If there is child-parent relationships among the subexpressions, we want the > child expressions come first than parent expressions, so we can replace child > expressions in parent expressions with subexpression evaluation. > For example, we have two different expressions Add(Literal(1), Literal(2)) > and Add(Literal(3), add). > Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can > deal with it. > addExprTree(add) > addExprTree(Add(Literal(3), add)) > addExprTree(Add(Literal(3), add)) > Case 2: parent subexpr comes first. For this case, we need to sort equivalent > expressions. > addExprTree(Add(Literal(3), add)) => We add `Add(Literal(3), add)` into the > map first, then add `add` into the map > addExprTree(add) > addExprTree(Add(Literal(3), add)) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-35439: Affects Version/s: (was: 3.1.1) (was: 3.0.2) > Children subexpr should come first than parent subexpr in subexpression > elimination > --- > > Key: SPARK-35439 > URL: https://issues.apache.org/jira/browse/SPARK-35439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > EquivalentExpressions maintains a map of equivalent expressions. It is > HashMap now so the insertion order is not guaranteed to be preserved later. > Subexpression elimination relies on retrieving subexpressions from the map. > If there is child-parent relationships among the subexpressions, we want the > child expressions come first than parent expressions, so we can replace child > expressions in parent expressions with subexpression evaluation. > For example, we have two different expressions Add(Literal(1), Literal(2)) > and Add(Literal(3), add). > Case 1: child subexpr comes first. Replacing HashMap with LinkedHashMap can > deal with it. > addExprTree(add) > addExprTree(Add(Literal(3), add)) > addExprTree(Add(Literal(3), add)) > Case 2: parent subexpr comes first. For this case, we need to sort equivalent > expressions. > addExprTree(Add(Literal(3), add)) => We add `Add(Literal(3), add)` into the > map first, then add `add` into the map > addExprTree(add) > addExprTree(Add(Literal(3), add)) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35439) Children subexpr should come first than parent subexpr in subexpression elimination
[ https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-35439: Summary: Children subexpr should come first than parent subexpr in subexpression elimination (was: Use LinkedHashMap as the map of equivalent expressions to preserve insertion order) > Children subexpr should come first than parent subexpr in subexpression > elimination > --- > > Key: SPARK-35439 > URL: https://issues.apache.org/jira/browse/SPARK-35439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Minor > > EquivalentExpressions maintains a map of equivalent expressions. It is > HashMap now so the insertion order is not guaranteed to be preserved later. > Subexpression elimination relies on retrieving subexpressions from the map. > If there is child-parent relationships among the subexpressions, we want the > child expressions come first than parent expressions, so we can replace child > expressions in parent expressions with subexpression evaluation. > Although we add expressions recursively into the map with depth-first > approach, when we retrieve the map values, it is not guaranteed that the > order is preserved. We should use LinkedHashMap for this usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35129) Construct year-month interval column from integral fields
[ https://issues.apache.org/jira/browse/SPARK-35129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347306#comment-17347306 ] Max Gekk commented on SPARK-35129: -- [~mrpowers] Are you working on this? > Construct year-month interval column from integral fields > - > > Key: SPARK-35129 > URL: https://issues.apache.org/jira/browse/SPARK-35129 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Create new function similar to make_interval() (or extend the make_interval() > function) which can construct YearMonthIntervalType values from the year, > month fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-35291) NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.CreateExternalRow_0$(Unknown Source)
[ https://issues.apache.org/jira/browse/SPARK-35291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347302#comment-17347302 ] Umar Asir edited comment on SPARK-35291 at 5/19/21, 5:19 AM: - It's an issue with delta. but null pointer is thrown from "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.CreateExternalRow_0$(Unknown Source)" was (Author: umerasir): It's an issue with delta. > NullPointerException at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.CreateExternalRow_0$(Unknown > Source) > > > Key: SPARK-35291 > URL: https://issues.apache.org/jira/browse/SPARK-35291 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.3.1, 3.0.2 >Reporter: Umar Asir >Priority: Major > Attachments: NotNullIssue.scala, cdwqasourceupdate.7z, > cdwqatgtupdate.7z, pom.xml, run1.log > > > We are trying to merge data using DeltaTable's merge API. On inserting a null > value into the not-null column results in NullPointerException instead of > throwing constrain violation error > *Code :* > {code:java} > package com.uasir.cdw.delta > import org.apache.spark.sql._ > import io.delta.tables._ > object NotNullIssue { > def main(args: Array[String]): Unit = { > System.setProperty("hadoop.home.dir", "C:\\Tools\\hadoop\\") > val spark = SparkSession > .builder() > .appName("DFMergeTest") > .master("local[*]") > .config("spark.sql.extensions", > "io.delta.sql.DeltaSparkSessionExtension") > .config("spark.sql.catalog.spark_catalog", > "org.apache.spark.sql.delta.catalog.DeltaCatalog") > .config("spark.testing.memory", "571859200") > .getOrCreate() > println("Reading from the source table") > val df = spark.read.format("delta").load("C:\\Input\\cdwqasourceupdate") > \\PFA cdwqasourceupdate.7z > df.show() > println("Reading from the target table") > val tgtDf = spark.read.format("delta").load("C:\\Input\\cdwqatgtupdate") > \\PFA cdwqatgtupdate.7z > tgtDf.show() > val sourceTable = "source" > val targetDataTable = "target" > val colMap= scala.collection.mutable.Map[String,String]() > val sourceFields = df.schema.fieldNames > val targetFields = tgtDf.schema.fieldNames > for ( i <- 0 until targetFields.length) { > colMap(targetFields(i)) = sourceTable + "." + sourceFields(i) > } > /* colMap will be generated as : > TGTID -> c1_ID > TGT_NAME -> c2_NAME > TGT_ADDRESS -> c3_address > TGT_DOB -> c4_dob > */ > println("update") > DeltaTable.forPath(spark, "C:\\Input\\cdwqatgtupdate") > .as(targetDataTable) > .merge( > df.as(sourceTable), > targetDataTable + "." + "TGTID" + " = " + sourceTable + "." + > "c1_ID" ) > .whenMatched() > .updateExpr(colMap) > .execute() > println("Reading from target the table after operation") > tgtDf.show() > } > } > {code} > *Error :* > {code:java} > Caused by: java.lang.RuntimeException: Error while decoding: > java.lang.NullPointerExceptionCaused by: java.lang.RuntimeException: Error > while decoding: java.lang.NullPointerExceptioncreateexternalrow(input[0, int, > false], input[1, string, false].toString, input[2, string, true].toString, > staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, > ObjectType(class java.sql.Date), toJavaDate, input[3, date, true], true, > false), StructField(TGTID,IntegerType,false), > StructField(TGT_NAME,StringType,false), > StructField(TGT_ADDRESS,StringType,true), StructField(TGT_DOB,DateType,true)) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:188) > at > org.apache.spark.sql.delta.commands.MergeIntoCommand$JoinedRowProcessor.$anonfun$processPartition$9(MergeIntoCommand.scala:565) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at > scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:265) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWrite
[jira] [Commented] (SPARK-35291) NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.CreateExternalRow_0$(Unknown Source)
[ https://issues.apache.org/jira/browse/SPARK-35291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347302#comment-17347302 ] Umar Asir commented on SPARK-35291: --- It's an issue with delta. > NullPointerException at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.CreateExternalRow_0$(Unknown > Source) > > > Key: SPARK-35291 > URL: https://issues.apache.org/jira/browse/SPARK-35291 > Project: Spark > Issue Type: Bug > Components: Optimizer, SQL >Affects Versions: 2.3.1, 3.0.2 >Reporter: Umar Asir >Priority: Major > Attachments: NotNullIssue.scala, cdwqasourceupdate.7z, > cdwqatgtupdate.7z, pom.xml, run1.log > > > We are trying to merge data using DeltaTable's merge API. On inserting a null > value into the not-null column results in NullPointerException instead of > throwing constrain violation error > *Code :* > {code:java} > package com.uasir.cdw.delta > import org.apache.spark.sql._ > import io.delta.tables._ > object NotNullIssue { > def main(args: Array[String]): Unit = { > System.setProperty("hadoop.home.dir", "C:\\Tools\\hadoop\\") > val spark = SparkSession > .builder() > .appName("DFMergeTest") > .master("local[*]") > .config("spark.sql.extensions", > "io.delta.sql.DeltaSparkSessionExtension") > .config("spark.sql.catalog.spark_catalog", > "org.apache.spark.sql.delta.catalog.DeltaCatalog") > .config("spark.testing.memory", "571859200") > .getOrCreate() > println("Reading from the source table") > val df = spark.read.format("delta").load("C:\\Input\\cdwqasourceupdate") > \\PFA cdwqasourceupdate.7z > df.show() > println("Reading from the target table") > val tgtDf = spark.read.format("delta").load("C:\\Input\\cdwqatgtupdate") > \\PFA cdwqatgtupdate.7z > tgtDf.show() > val sourceTable = "source" > val targetDataTable = "target" > val colMap= scala.collection.mutable.Map[String,String]() > val sourceFields = df.schema.fieldNames > val targetFields = tgtDf.schema.fieldNames > for ( i <- 0 until targetFields.length) { > colMap(targetFields(i)) = sourceTable + "." + sourceFields(i) > } > /* colMap will be generated as : > TGTID -> c1_ID > TGT_NAME -> c2_NAME > TGT_ADDRESS -> c3_address > TGT_DOB -> c4_dob > */ > println("update") > DeltaTable.forPath(spark, "C:\\Input\\cdwqatgtupdate") > .as(targetDataTable) > .merge( > df.as(sourceTable), > targetDataTable + "." + "TGTID" + " = " + sourceTable + "." + > "c1_ID" ) > .whenMatched() > .updateExpr(colMap) > .execute() > println("Reading from target the table after operation") > tgtDf.show() > } > } > {code} > *Error :* > {code:java} > Caused by: java.lang.RuntimeException: Error while decoding: > java.lang.NullPointerExceptionCaused by: java.lang.RuntimeException: Error > while decoding: java.lang.NullPointerExceptioncreateexternalrow(input[0, int, > false], input[1, string, false].toString, input[2, string, true].toString, > staticinvoke(class org.apache.spark.sql.catalyst.util.DateTimeUtils$, > ObjectType(class java.sql.Date), toJavaDate, input[3, date, true], true, > false), StructField(TGTID,IntegerType,false), > StructField(TGT_NAME,StringType,false), > StructField(TGT_ADDRESS,StringType,true), StructField(TGT_DOB,DateType,true)) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:188) > at > org.apache.spark.sql.delta.commands.MergeIntoCommand$JoinedRowProcessor.$anonfun$processPartition$9(MergeIntoCommand.scala:565) > at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at > scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown > Source) at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:265) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at > org.apache.spark.scheduler.Task.run(Task.scala:127) at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462) > at org.apache.spark.ut
[jira] [Assigned] (SPARK-35440) Add language type to `ExpressionInfo` for UDF
[ https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35440: Assignee: Apache Spark > Add language type to `ExpressionInfo` for UDF > - > > Key: SPARK-35440 > URL: https://issues.apache.org/jira/browse/SPARK-35440 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.20 >Reporter: Linhong Liu >Assignee: Apache Spark >Priority: Major > > add "scala", "java", "python", "hive", "built-in" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35440) Add language type to `ExpressionInfo` for UDF
[ https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35440: Assignee: (was: Apache Spark) > Add language type to `ExpressionInfo` for UDF > - > > Key: SPARK-35440 > URL: https://issues.apache.org/jira/browse/SPARK-35440 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.20 >Reporter: Linhong Liu >Priority: Major > > add "scala", "java", "python", "hive", "built-in" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35440) Add language type to `ExpressionInfo` for UDF
[ https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347299#comment-17347299 ] Apache Spark commented on SPARK-35440: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/32587 > Add language type to `ExpressionInfo` for UDF > - > > Key: SPARK-35440 > URL: https://issues.apache.org/jira/browse/SPARK-35440 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.20 >Reporter: Linhong Liu >Priority: Major > > add "scala", "java", "python", "hive", "built-in" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35440) Add language type to `ExpressionInfo` for UDF
[ https://issues.apache.org/jira/browse/SPARK-35440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Linhong Liu updated SPARK-35440: Description: add "scala", "java", "python", "hive", "built-in" > Add language type to `ExpressionInfo` for UDF > - > > Key: SPARK-35440 > URL: https://issues.apache.org/jira/browse/SPARK-35440 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.20 >Reporter: Linhong Liu >Priority: Major > > add "scala", "java", "python", "hive", "built-in" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35440) Add language type to `ExpressionInfo` for UDF
Linhong Liu created SPARK-35440: --- Summary: Add language type to `ExpressionInfo` for UDF Key: SPARK-35440 URL: https://issues.apache.org/jira/browse/SPARK-35440 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.20 Reporter: Linhong Liu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35398) Simplify the way to get classes from ClassBodyEvaluator in CodeGenerator.updateAndGetCompilationStats method
[ https://issues.apache.org/jira/browse/SPARK-35398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-35398. -- Fix Version/s: 3.2.0 Assignee: Yang Jie Resolution: Fixed Resolved by https://github.com/apache/spark/pull/32536 > Simplify the way to get classes from ClassBodyEvaluator in > CodeGenerator.updateAndGetCompilationStats method > > > Key: SPARK-35398 > URL: https://issues.apache.org/jira/browse/SPARK-35398 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Trivial > Fix For: 3.2.0 > > > SPARK-35253 upgraded janino from 3.0.16 to 3.1.4, {{ClassBodyEvaluator}} > provides the {{getBytecodes}} method to get > the mapping from {{ClassFile.getThisClassName}} to {{ClassFile.toByteArray}} > directly in this version and we don't need to get this variable by reflection > api anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35439) Use LinkedHashMap as the map of equivalent expressions to preserve insertion order
[ https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35439: Assignee: Apache Spark (was: L. C. Hsieh) > Use LinkedHashMap as the map of equivalent expressions to preserve insertion > order > -- > > Key: SPARK-35439 > URL: https://issues.apache.org/jira/browse/SPARK-35439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Minor > > EquivalentExpressions maintains a map of equivalent expressions. It is > HashMap now so the insertion order is not guaranteed to be preserved later. > Subexpression elimination relies on retrieving subexpressions from the map. > If there is child-parent relationships among the subexpressions, we want the > child expressions come first than parent expressions, so we can replace child > expressions in parent expressions with subexpression evaluation. > Although we add expressions recursively into the map with depth-first > approach, when we retrieve the map values, it is not guaranteed that the > order is preserved. We should use LinkedHashMap for this usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35439) Use LinkedHashMap as the map of equivalent expressions to preserve insertion order
[ https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35439: Assignee: Apache Spark (was: L. C. Hsieh) > Use LinkedHashMap as the map of equivalent expressions to preserve insertion > order > -- > > Key: SPARK-35439 > URL: https://issues.apache.org/jira/browse/SPARK-35439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Minor > > EquivalentExpressions maintains a map of equivalent expressions. It is > HashMap now so the insertion order is not guaranteed to be preserved later. > Subexpression elimination relies on retrieving subexpressions from the map. > If there is child-parent relationships among the subexpressions, we want the > child expressions come first than parent expressions, so we can replace child > expressions in parent expressions with subexpression evaluation. > Although we add expressions recursively into the map with depth-first > approach, when we retrieve the map values, it is not guaranteed that the > order is preserved. We should use LinkedHashMap for this usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35439) Use LinkedHashMap as the map of equivalent expressions to preserve insertion order
[ https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35439: Assignee: L. C. Hsieh (was: Apache Spark) > Use LinkedHashMap as the map of equivalent expressions to preserve insertion > order > -- > > Key: SPARK-35439 > URL: https://issues.apache.org/jira/browse/SPARK-35439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > > EquivalentExpressions maintains a map of equivalent expressions. It is > HashMap now so the insertion order is not guaranteed to be preserved later. > Subexpression elimination relies on retrieving subexpressions from the map. > If there is child-parent relationships among the subexpressions, we want the > child expressions come first than parent expressions, so we can replace child > expressions in parent expressions with subexpression evaluation. > Although we add expressions recursively into the map with depth-first > approach, when we retrieve the map values, it is not guaranteed that the > order is preserved. We should use LinkedHashMap for this usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35439) Use LinkedHashMap as the map of equivalent expressions to preserve insertion order
[ https://issues.apache.org/jira/browse/SPARK-35439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347286#comment-17347286 ] Apache Spark commented on SPARK-35439: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/32586 > Use LinkedHashMap as the map of equivalent expressions to preserve insertion > order > -- > > Key: SPARK-35439 > URL: https://issues.apache.org/jira/browse/SPARK-35439 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Minor > > EquivalentExpressions maintains a map of equivalent expressions. It is > HashMap now so the insertion order is not guaranteed to be preserved later. > Subexpression elimination relies on retrieving subexpressions from the map. > If there is child-parent relationships among the subexpressions, we want the > child expressions come first than parent expressions, so we can replace child > expressions in parent expressions with subexpression evaluation. > Although we add expressions recursively into the map with depth-first > approach, when we retrieve the map values, it is not guaranteed that the > order is preserved. We should use LinkedHashMap for this usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35439) Use LinkedHashMap as the map of equivalent expressions to preserve insertion order
L. C. Hsieh created SPARK-35439: --- Summary: Use LinkedHashMap as the map of equivalent expressions to preserve insertion order Key: SPARK-35439 URL: https://issues.apache.org/jira/browse/SPARK-35439 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.1, 3.0.2, 3.2.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh EquivalentExpressions maintains a map of equivalent expressions. It is HashMap now so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation. Although we add expressions recursively into the map with depth-first approach, when we retrieve the map values, it is not guaranteed that the order is preserved. We should use LinkedHashMap for this usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
[ https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan resolved SPARK-35263. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32389 [https://github.com/apache/spark/pull/32389] > Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code > --- > > Key: SPARK-35263 > URL: https://issues.apache.org/jira/browse/SPARK-35263 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Tests >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.2.0 > > > {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like: > {code} > val iterator = new ShuffleBlockFetcherIterator( > taskContext, > transfer, > blockManager, > blocksByAddress, > (_, in) => in, > 48 * 1024 * 1024, > Int.MaxValue, > Int.MaxValue, > Int.MaxValue, > true, > false, > metrics, > false) > {code} > It's challenging to tell what the interesting parts are vs. what is just > being set to some default/unused value. > Similarly but not as bad, there are 10 calls like: > {code} > verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), > any()) > {code} > and 7 like > {code} > when(transfer.fetchBlocks(any(), any(), any(), any(), any(), > any())).thenAnswer ... > {code} > This can result in about 10% reduction in both lines and characters in the > file: > {code} > # Before > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 10633950 43201 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > # After > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 9283609 39053 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > {code} > It also helps readability: > {code} > val iterator = createShuffleBlockIteratorWithDefaults( > transfer, > blocksByAddress, > maxBytesInFlight = 1000L > ) > {code} > Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're > interested in here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35263) Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code
[ https://issues.apache.org/jira/browse/SPARK-35263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mridul Muralidharan reassigned SPARK-35263: --- Assignee: Erik Krogen > Refactor ShuffleBlockFetcherIteratorSuite to reduce duplicated code > --- > > Key: SPARK-35263 > URL: https://issues.apache.org/jira/browse/SPARK-35263 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Tests >Affects Versions: 3.1.1 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > > {{ShuffleFetcherBlockIteratorSuite}} has tons of duplicate code, like: > {code} > val iterator = new ShuffleBlockFetcherIterator( > taskContext, > transfer, > blockManager, > blocksByAddress, > (_, in) => in, > 48 * 1024 * 1024, > Int.MaxValue, > Int.MaxValue, > Int.MaxValue, > true, > false, > metrics, > false) > {code} > It's challenging to tell what the interesting parts are vs. what is just > being set to some default/unused value. > Similarly but not as bad, there are 10 calls like: > {code} > verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), > any()) > {code} > and 7 like > {code} > when(transfer.fetchBlocks(any(), any(), any(), any(), any(), > any())).thenAnswer ... > {code} > This can result in about 10% reduction in both lines and characters in the > file: > {code} > # Before > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 10633950 43201 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > # After > > wc > > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > 9283609 39053 > core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala > {code} > It also helps readability: > {code} > val iterator = createShuffleBlockIteratorWithDefaults( > transfer, > blocksByAddress, > maxBytesInFlight = 1000L > ) > {code} > Now I can clearly tell that {{maxBytesInFlight}} is the main parameter we're > interested in here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35370) IllegalArgumentException when loading a PipelineModel with Spark 3
[ https://issues.apache.org/jira/browse/SPARK-35370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-35370. -- Resolution: Not A Problem > IllegalArgumentException when loading a PipelineModel with Spark 3 > -- > > Key: SPARK-35370 > URL: https://issues.apache.org/jira/browse/SPARK-35370 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 3.1.0, 3.1.1 > Environment: spark 3.1.1 >Reporter: Avenash Kabeera >Priority: Minor > Labels: V3, decisiontree, scala, treemodels > > Hi, > This is a followup of the this issue > https://issues.apache.org/jira/browse/SPARK-33398 that fixed an exception > when loading a model in Spark 3 that trained in Spark2. After incorporating > this fix in my project, I ran into another issue which was introduced in the > fix [https://github.com/apache/spark/pull/30889/files.] > While loading my random forest model which was trained in Spark 2.2, I ran > into the following exception: > {code:java} > 16:03:34 ERROR Instrumentation:73 - java.lang.IllegalArgumentException: > nodeData does not exist. Available: treeid, nodedata > at > org.apache.spark.sql.types.StructType.$anonfun$apply$1(StructType.scala:278) > at scala.collection.immutable.Map$Map2.getOrElse(Map.scala:147) > at org.apache.spark.sql.types.StructType.apply(StructType.scala:277) > at > org.apache.spark.ml.tree.EnsembleModelReadWrite$.loadImpl(treeModels.scala:522) > at > org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:420) > at > org.apache.spark.ml.classification.RandomForestClassificationModel$RandomForestClassificationModelReader.load(RandomForestClassifier.scala:410) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$5(Pipeline.scala:277) > at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160) > at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155) > at > org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$4(Pipeline.scala:277) > at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$load$3(Pipeline.scala:274) > at > org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) > at org.apache.spark.ml.Pipeline$SharedReadWrite$.load(Pipeline.scala:268) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$7(Pipeline.scala:356) > at org.apache.spark.ml.MLEvents.withLoadInstanceEvent(events.scala:160) > at org.apache.spark.ml.MLEvents.withLoadInstanceEvent$(events.scala:155) > at > org.apache.spark.ml.util.Instrumentation.withLoadInstanceEvent(Instrumentation.scala:42) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.$anonfun$load$6(Pipeline.scala:355) > at > org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) > at scala.util.Try$.apply(Try.scala:213) > at > org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:355) > at > org.apache.spark.ml.PipelineModel$PipelineModelReader.load(Pipeline.scala:349) > at org.apache.spark.ml.util.MLReadable.load(ReadWrite.scala:355) > at org.apache.spark.ml.util.MLReadable.load$(ReadWrite.scala:355) > at org.apache.spark.ml.PipelineModel$.load(Pipeline.scala:337){code} > When I looked at the data for the model, I see the schema is using > "*nodedata*" instead of "*nodeData*." Here is what my model looks like: > {code:java} > +--+-+ > |treeid|nodedata >| > +--+-+ > |12|{0, 1.0, 0.20578590428109744, [2492
[jira] [Commented] (SPARK-35129) Construct year-month interval column from integral fields
[ https://issues.apache.org/jira/browse/SPARK-35129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347271#comment-17347271 ] angerszhu commented on SPARK-35129: --- Can I take this if no update for a long time? > Construct year-month interval column from integral fields > - > > Key: SPARK-35129 > URL: https://issues.apache.org/jira/browse/SPARK-35129 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Max Gekk >Priority: Major > > Create new function similar to make_interval() (or extend the make_interval() > function) which can construct YearMonthIntervalType values from the year, > month fields. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35438) Minor documentation fix for window physical operator
[ https://issues.apache.org/jira/browse/SPARK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347219#comment-17347219 ] Apache Spark commented on SPARK-35438: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/32585 > Minor documentation fix for window physical operator > > > Key: SPARK-35438 > URL: https://issues.apache.org/jira/browse/SPARK-35438 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Trivial > > As title. Fixed two places where the documentation has some error. Help > people read code more easily in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35438) Minor documentation fix for window physical operator
[ https://issues.apache.org/jira/browse/SPARK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35438: Assignee: Apache Spark > Minor documentation fix for window physical operator > > > Key: SPARK-35438 > URL: https://issues.apache.org/jira/browse/SPARK-35438 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Trivial > > As title. Fixed two places where the documentation has some error. Help > people read code more easily in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35438) Minor documentation fix for window physical operator
[ https://issues.apache.org/jira/browse/SPARK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35438: Assignee: (was: Apache Spark) > Minor documentation fix for window physical operator > > > Key: SPARK-35438 > URL: https://issues.apache.org/jira/browse/SPARK-35438 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Trivial > > As title. Fixed two places where the documentation has some error. Help > people read code more easily in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35438) Minor documentation fix for window physical operator
[ https://issues.apache.org/jira/browse/SPARK-35438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347218#comment-17347218 ] Apache Spark commented on SPARK-35438: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/32585 > Minor documentation fix for window physical operator > > > Key: SPARK-35438 > URL: https://issues.apache.org/jira/browse/SPARK-35438 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Trivial > > As title. Fixed two places where the documentation has some error. Help > people read code more easily in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35438) Minor documentation fix for window physical operator
Cheng Su created SPARK-35438: Summary: Minor documentation fix for window physical operator Key: SPARK-35438 URL: https://issues.apache.org/jira/browse/SPARK-35438 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.2.0 Reporter: Cheng Su As title. Fixed two places where the documentation has some error. Help people read code more easily in the future. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35305) Upgrade ZooKeeper to 3.7.0
[ https://issues.apache.org/jira/browse/SPARK-35305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35305. --- Resolution: Later According to the discussion on the PR, we already upgraded Netty library to remove those vulnerability. So, we don't need to upgrade Zookeeper. - https://github.com/apache/spark/pull/32572#pullrequestreview-661653211 > Upgrade ZooKeeper to 3.7.0 > -- > > Key: SPARK-35305 > URL: https://issues.apache.org/jira/browse/SPARK-35305 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.2.0 >Reporter: Hari Prasad G >Priority: Major > > Upgrade ZooKeeper to 3.7.0 to fix the vulnerabilities. > *List of CVE's:* > * CVE-2021-21295 > * CVE-2021-21290 > * CVE-2021-21409 > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35425) Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md
[ https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35425: -- Target Version/s: 3.0.3, 3.1.2, 3.2.0 (was: 3.20) > Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the > release README.md > --- > > Key: SPARK-35425 > URL: https://issues.apache.org/jira/browse/SPARK-35425 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > SPARK-35375 confined the version of Jinja to <3.0.0. > So it's good to note about it in docs/README.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35425) Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md
[ https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35425: -- Issue Type: Bug (was: Improvement) > Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the > release README.md > --- > > Key: SPARK-35425 > URL: https://issues.apache.org/jira/browse/SPARK-35425 > Project: Spark > Issue Type: Bug > Components: docs >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > SPARK-35375 confined the version of Jinja to <3.0.0. > So it's good to note about it in docs/README.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35425) Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md
[ https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35425: -- Affects Version/s: 3.0.2 3.1.1 > Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the > release README.md > --- > > Key: SPARK-35425 > URL: https://issues.apache.org/jira/browse/SPARK-35425 > Project: Spark > Issue Type: Improvement > Components: docs >Affects Versions: 3.0.2, 3.1.1, 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > SPARK-35375 confined the version of Jinja to <3.0.0. > So it's good to note about it in docs/README.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35425) Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md
[ https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35425: -- Fix Version/s: 3.2.0 3.1.2 3.0.3 > Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the > release README.md > --- > > Key: SPARK-35425 > URL: https://issues.apache.org/jira/browse/SPARK-35425 > Project: Spark > Issue Type: Improvement > Components: docs >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.0.3, 3.1.2, 3.2.0 > > > SPARK-35375 confined the version of Jinja to <3.0.0. > So it's good to note about it in docs/README.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35425) Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md
[ https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35425: -- Summary: Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the release README.md (was: Add note about Jinja2 as a required dependency for document build.) > Pin jinja2 in spark-rm/Dockerfile and add as a required dependency in the > release README.md > --- > > Key: SPARK-35425 > URL: https://issues.apache.org/jira/browse/SPARK-35425 > Project: Spark > Issue Type: Improvement > Components: docs >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > SPARK-35375 confined the version of Jinja to <3.0.0. > So it's good to note about it in docs/README.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0
[ https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35434: -- Affects Version/s: (was: 3.20) 3.2.0 > Upgrade scalatestplus artifacts to 3.2.9.0 > -- > > Key: SPARK-35434 > URL: https://issues.apache.org/jira/browse/SPARK-35434 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.2.0 > > > scalatestplus artifacts seem to be renamed and their latest release supports > dotty. > * https://github.com/scalatest/scalatestplus-scalacheck > * https://github.com/scalatest/scalatestplus-mockito > * https://github.com/scalatest/scalatestplus-selenium -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0
[ https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-35434: -- Parent: SPARK-25075 Issue Type: Sub-task (was: Improvement) > Upgrade scalatestplus artifacts to 3.2.9.0 > -- > > Key: SPARK-35434 > URL: https://issues.apache.org/jira/browse/SPARK-35434 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.20 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.2.0 > > > scalatestplus artifacts seem to be renamed and their latest release supports > dotty. > * https://github.com/scalatest/scalatestplus-scalacheck > * https://github.com/scalatest/scalatestplus-mockito > * https://github.com/scalatest/scalatestplus-selenium -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0
[ https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-35434. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32581 [https://github.com/apache/spark/pull/32581] > Upgrade scalatestplus artifacts to 3.2.9.0 > -- > > Key: SPARK-35434 > URL: https://issues.apache.org/jira/browse/SPARK-35434 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.20 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.2.0 > > > scalatestplus artifacts seem to be renamed and their latest release supports > dotty. > * https://github.com/scalatest/scalatestplus-scalacheck > * https://github.com/scalatest/scalatestplus-mockito > * https://github.com/scalatest/scalatestplus-selenium -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35437) Hive partition filtering client optimization
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17347017#comment-17347017 ] Apache Spark commented on SPARK-35437: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/32583 > Hive partition filtering client optimization > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Priority: Minor > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35437) Hive partition filtering client optimization
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35437: Assignee: Apache Spark > Hive partition filtering client optimization > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Assignee: Apache Spark >Priority: Minor > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35437) Hive partition filtering client optimization
[ https://issues.apache.org/jira/browse/SPARK-35437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35437: Assignee: (was: Apache Spark) > Hive partition filtering client optimization > > > Key: SPARK-35437 > URL: https://issues.apache.org/jira/browse/SPARK-35437 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.1 >Reporter: dzcxzl >Priority: Minor > > When we have a table with a lot of partitions and there is no way to filter > it on the MetaStore Server, we will get all the partition details and filter > it on the client side. This is slow and puts a lot of pressure on the > MetaStore Server. > We can first pull all the partition names, filter by expressions, and then > obtain detailed information about the corresponding partitions from the > MetaStore Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34734) Update sbt version to 1.4.9
[ https://issues.apache.org/jira/browse/SPARK-34734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34734: -- Parent: SPARK-25075 Issue Type: Sub-task (was: Improvement) > Update sbt version to 1.4.9 > --- > > Key: SPARK-34734 > URL: https://issues.apache.org/jira/browse/SPARK-34734 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: William Hyun >Assignee: William Hyun >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34959) Upgrade SBT to 1.5.0
[ https://issues.apache.org/jira/browse/SPARK-34959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34959: -- Parent: SPARK-25075 Issue Type: Sub-task (was: Improvement) > Upgrade SBT to 1.5.0 > > > Key: SPARK-34959 > URL: https://issues.apache.org/jira/browse/SPARK-34959 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.2.0 > > > This JIRA issue aims to upgrade SBT to 1.5.0 which has built-in Scala 3 > support. > https://github.com/sbt/sbt/releases/tag/v1.5.0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35437) Hive partition filtering client optimization
dzcxzl created SPARK-35437: -- Summary: Hive partition filtering client optimization Key: SPARK-35437 URL: https://issues.apache.org/jira/browse/SPARK-35437 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.1 Reporter: dzcxzl When we have a table with a lot of partitions and there is no way to filter it on the MetaStore Server, we will get all the partition details and filter it on the client side. This is slow and puts a lot of pressure on the MetaStore Server. We can first pull all the partition names, filter by expressions, and then obtain detailed information about the corresponding partitions from the MetaStore Server. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35411) Essential information missing in TreeNode json string
[ https://issues.apache.org/jira/browse/SPARK-35411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35411. - Fix Version/s: 3.1.2 3.2.0 Resolution: Fixed Issue resolved by pull request 32557 [https://github.com/apache/spark/pull/32557] > Essential information missing in TreeNode json string > - > > Key: SPARK-35411 > URL: https://issues.apache.org/jira/browse/SPARK-35411 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: huangtengfei >Assignee: huangtengfei >Priority: Minor > Fix For: 3.2.0, 3.1.2 > > > TreeNode can be serialized to json string with the method toJSON() or > prettyJson(). To avoid OOM issues, > [SPARK-17426|https://issues.apache.org/jira/browse/SPARK-17426] only keep > part of Seq data that can be written out to result json string. > Essential data like > [cteRelations|https://github.com/apache/spark/blob/v3.1.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L497] > in node With, > [branches|https://github.com/apache/spark/blob/v3.1.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L123] > in CaseWhen will be skipped and written out as null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35411) Essential information missing in TreeNode json string
[ https://issues.apache.org/jira/browse/SPARK-35411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35411: --- Assignee: huangtengfei > Essential information missing in TreeNode json string > - > > Key: SPARK-35411 > URL: https://issues.apache.org/jira/browse/SPARK-35411 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.1 >Reporter: huangtengfei >Assignee: huangtengfei >Priority: Minor > > TreeNode can be serialized to json string with the method toJSON() or > prettyJson(). To avoid OOM issues, > [SPARK-17426|https://issues.apache.org/jira/browse/SPARK-17426] only keep > part of Seq data that can be written out to result json string. > Essential data like > [cteRelations|https://github.com/apache/spark/blob/v3.1.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala#L497] > in node With, > [branches|https://github.com/apache/spark/blob/v3.1.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/conditionalExpressions.scala#L123] > in CaseWhen will be skipped and written out as null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35436) RocksDBFileManager - save checkpoint to DFS
[ https://issues.apache.org/jira/browse/SPARK-35436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346978#comment-17346978 ] Apache Spark commented on SPARK-35436: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/32582 > RocksDBFileManager - save checkpoint to DFS > --- > > Key: SPARK-35436 > URL: https://issues.apache.org/jira/browse/SPARK-35436 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Priority: Major > > The implementation for the save operation of RocksDBFileManager. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35436) RocksDBFileManager - save checkpoint to DFS
[ https://issues.apache.org/jira/browse/SPARK-35436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346976#comment-17346976 ] Apache Spark commented on SPARK-35436: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/32582 > RocksDBFileManager - save checkpoint to DFS > --- > > Key: SPARK-35436 > URL: https://issues.apache.org/jira/browse/SPARK-35436 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Priority: Major > > The implementation for the save operation of RocksDBFileManager. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35436) RocksDBFileManager - save checkpoint to DFS
[ https://issues.apache.org/jira/browse/SPARK-35436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35436: Assignee: Apache Spark > RocksDBFileManager - save checkpoint to DFS > --- > > Key: SPARK-35436 > URL: https://issues.apache.org/jira/browse/SPARK-35436 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Assignee: Apache Spark >Priority: Major > > The implementation for the save operation of RocksDBFileManager. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35436) RocksDBFileManager - save checkpoint to DFS
[ https://issues.apache.org/jira/browse/SPARK-35436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35436: Assignee: (was: Apache Spark) > RocksDBFileManager - save checkpoint to DFS > --- > > Key: SPARK-35436 > URL: https://issues.apache.org/jira/browse/SPARK-35436 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Yuanjian Li >Priority: Major > > The implementation for the save operation of RocksDBFileManager. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35079) Transform with udf gives incorrect result
[ https://issues.apache.org/jira/browse/SPARK-35079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346970#comment-17346970 ] koert kuipers commented on SPARK-35079: --- looks to me like this is a duplicate of SPARK-34829 > Transform with udf gives incorrect result > - > > Key: SPARK-35079 > URL: https://issues.apache.org/jira/browse/SPARK-35079 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: koert kuipers >Priority: Minor > Fix For: 3.1.2, 3.2.0 > > > i think this is a correctness bug in spark 3.1.1 > the behavior is correct in spark 3.0.1 > in spark 3.0.1: > {code:java} > scala> import spark.implicits._ > scala> import org.apache.spark.sql.functions._ > scala> val x = Seq(Seq("aa", "bb", "cc")).toDF > x: org.apache.spark.sql.DataFrame = [value: array] > scala> x.select(transform(col("value"), col => udf((_: > String).drop(1)).apply(col))).show > +---+ > |transform(value, lambdafunction(UDF(lambda 'x), x))| > +---+ > | [a, b, c]| > +---+ > {code} > in spark 3.1.1: > {code:java} > scala> import spark.implicits._ > scala> import org.apache.spark.sql.functions._ > scala> val x = Seq(Seq("aa", "bb", "cc")).toDF > x: org.apache.spark.sql.DataFrame = [value: array] > scala> x.select(transform(col("value"), col => udf((_: > String).drop(1)).apply(col))).show > +---+ > |transform(value, lambdafunction(UDF(lambda 'x), x))| > +---+ > | [c, c, c]| > +---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35436) RocksDBFileManager - save checkpoint to DFS
Yuanjian Li created SPARK-35436: --- Summary: RocksDBFileManager - save checkpoint to DFS Key: SPARK-35436 URL: https://issues.apache.org/jira/browse/SPARK-35436 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 3.2.0 Reporter: Yuanjian Li The implementation for the save operation of RocksDBFileManager. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35435) Not able to capture driver and executor logs in console
Sugumar created SPARK-35435: --- Summary: Not able to capture driver and executor logs in console Key: SPARK-35435 URL: https://issues.apache.org/jira/browse/SPARK-35435 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.2.0 Environment: Production Reporter: Sugumar Fix For: 2.2.0 Running the spark job using the below command, but not able to capture the driver and executor logs in console # {{# Run on a Spark standalone cluster in cluster deploy mode with supervise}} # {{./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --deploy-mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000}} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0
[ https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35434: Assignee: Apache Spark (was: Kousuke Saruta) > Upgrade scalatestplus artifacts to 3.2.9.0 > -- > > Key: SPARK-35434 > URL: https://issues.apache.org/jira/browse/SPARK-35434 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.20 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Minor > > scalatestplus artifacts seem to be renamed and their latest release supports > dotty. > * https://github.com/scalatest/scalatestplus-scalacheck > * https://github.com/scalatest/scalatestplus-mockito > * https://github.com/scalatest/scalatestplus-selenium -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0
[ https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346888#comment-17346888 ] Apache Spark commented on SPARK-35434: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32581 > Upgrade scalatestplus artifacts to 3.2.9.0 > -- > > Key: SPARK-35434 > URL: https://issues.apache.org/jira/browse/SPARK-35434 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.20 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > scalatestplus artifacts seem to be renamed and their latest release supports > dotty. > * https://github.com/scalatest/scalatestplus-scalacheck > * https://github.com/scalatest/scalatestplus-mockito > * https://github.com/scalatest/scalatestplus-selenium -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0
[ https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-35434: Assignee: Kousuke Saruta (was: Apache Spark) > Upgrade scalatestplus artifacts to 3.2.9.0 > -- > > Key: SPARK-35434 > URL: https://issues.apache.org/jira/browse/SPARK-35434 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.20 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > scalatestplus artifacts seem to be renamed and their latest release supports > dotty. > * https://github.com/scalatest/scalatestplus-scalacheck > * https://github.com/scalatest/scalatestplus-mockito > * https://github.com/scalatest/scalatestplus-selenium -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0
[ https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346887#comment-17346887 ] Apache Spark commented on SPARK-35434: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32581 > Upgrade scalatestplus artifacts to 3.2.9.0 > -- > > Key: SPARK-35434 > URL: https://issues.apache.org/jira/browse/SPARK-35434 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.20 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > scalatestplus artifacts seem to be renamed and their latest release supports > dotty. > * https://github.com/scalatest/scalatestplus-scalacheck > * https://github.com/scalatest/scalatestplus-mockito > * https://github.com/scalatest/scalatestplus-selenium -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35434) Upgrade scalatestplus artifacts to 3.2.9.0
[ https://issues.apache.org/jira/browse/SPARK-35434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-35434: --- Summary: Upgrade scalatestplus artifacts to 3.2.9.0 (was: Upgrade scalatestplus artifacts 3.2.9.0) > Upgrade scalatestplus artifacts to 3.2.9.0 > -- > > Key: SPARK-35434 > URL: https://issues.apache.org/jira/browse/SPARK-35434 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.20 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > scalatestplus artifacts seem to be renamed and their latest release supports > dotty. > * https://github.com/scalatest/scalatestplus-scalacheck > * https://github.com/scalatest/scalatestplus-mockito > * https://github.com/scalatest/scalatestplus-selenium -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35434) Upgrade scalatestplus artifacts 3.2.9.0
Kousuke Saruta created SPARK-35434: -- Summary: Upgrade scalatestplus artifacts 3.2.9.0 Key: SPARK-35434 URL: https://issues.apache.org/jira/browse/SPARK-35434 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.20 Reporter: Kousuke Saruta Assignee: Kousuke Saruta scalatestplus artifacts seem to be renamed and their latest release supports dotty. * https://github.com/scalatest/scalatestplus-scalacheck * https://github.com/scalatest/scalatestplus-mockito * https://github.com/scalatest/scalatestplus-selenium -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35433) Move CSV data source options from Python and Scala into a single page.
[ https://issues.apache.org/jira/browse/SPARK-35433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haejoon Lee updated SPARK-35433: Description: Refer to https://issues.apache.org/jira/browse/SPARK-34491 > Move CSV data source options from Python and Scala into a single page. > -- > > Key: SPARK-35433 > URL: https://issues.apache.org/jira/browse/SPARK-35433 > Project: Spark > Issue Type: Sub-task > Components: docs >Affects Versions: 3.2.0 >Reporter: Haejoon Lee >Priority: Major > > Refer to https://issues.apache.org/jira/browse/SPARK-34491 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35433) Move CSV data source options from Python and Scala into a single page.
Haejoon Lee created SPARK-35433: --- Summary: Move CSV data source options from Python and Scala into a single page. Key: SPARK-35433 URL: https://issues.apache.org/jira/browse/SPARK-35433 Project: Spark Issue Type: Sub-task Components: docs Affects Versions: 3.2.0 Reporter: Haejoon Lee -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18683) REST APIs for standalone Master、Workers and Applications
[ https://issues.apache.org/jira/browse/SPARK-18683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346739#comment-17346739 ] Mayank Asthana commented on SPARK-18683: Can we reopen this? My usecase is to have spark streaming applications running all the time, and we want to check through the REST API if these applications are still running and restart them if they are not. > REST APIs for standalone Master、Workers and Applications > > > Key: SPARK-18683 > URL: https://issues.apache.org/jira/browse/SPARK-18683 > Project: Spark > Issue Type: Improvement >Reporter: Shixiong Zhu >Priority: Major > Labels: bulk-closed > > It would be great that we have some REST APIs to access Master、Workers and > Applications information. Right now the only way to get them is using the Web > UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35425) Add note about Jinja2 as a required dependency for document build.
[ https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346717#comment-17346717 ] Apache Spark commented on SPARK-35425: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32580 > Add note about Jinja2 as a required dependency for document build. > -- > > Key: SPARK-35425 > URL: https://issues.apache.org/jira/browse/SPARK-35425 > Project: Spark > Issue Type: Improvement > Components: docs >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > SPARK-35375 confined the version of Jinja to <3.0.0. > So it's good to note about it in docs/README.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35422) Many test cases failed in Scala 2.13 CI
[ https://issues.apache.org/jira/browse/SPARK-35422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35422. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32577 [https://github.com/apache/spark/pull/32577] > Many test cases failed in Scala 2.13 CI > --- > > Key: SPARK-35422 > URL: https://issues.apache.org/jira/browse/SPARK-35422 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Takeshi Yamamuro >Priority: Major > Fix For: 3.2.0 > > > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/] > > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > > [org.apache.spark.sql.SQLQueryTestSuite.subquery/scalar-subquery/scalar-subquery-select.sql|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/]|2.4 > > sec|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q46_/]|59 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q53_/]|62 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q63)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q63_/]|54 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q68)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q68_/]|50 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q73)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q73_/]|58 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check > simplified sf100 > (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWithStatsSuite/check_simplified_sf100__tpcds_modifiedQueries_q46_/]|62 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check > simplified sf100 > (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWith
[jira] [Assigned] (SPARK-35422) Many test cases failed in Scala 2.13 CI
[ https://issues.apache.org/jira/browse/SPARK-35422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35422: --- Assignee: Takeshi Yamamuro > Many test cases failed in Scala 2.13 CI > --- > > Key: SPARK-35422 > URL: https://issues.apache.org/jira/browse/SPARK-35422 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yang Jie >Assignee: Takeshi Yamamuro >Priority: Major > > [https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/] > > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > > [org.apache.spark.sql.SQLQueryTestSuite.subquery/scalar-subquery/scalar-subquery-select.sql|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/subquery_scalar_subquery_scalar_subquery_select_sql/]|2.4 > > sec|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q46_/]|59 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q53_/]|62 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q63)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q63_/]|54 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q68)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q68_/]|50 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilitySuite.check simplified > (tpcds-modifiedQueries/q73)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilitySuite/check_simplified__tpcds_modifiedQueries_q73_/]|58 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check > simplified sf100 > (tpcds-modifiedQueries/q46)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWithStatsSuite/check_simplified_sf100__tpcds_modifiedQueries_q46_/]|62 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/1179/]| > |!https://amplab.cs.berkeley.edu/jenkins/static/ab03c134/images/16x16/document_add.png! > [org.apache.spark.sql.TPCDSModifiedPlanStabilityWithStatsSuite.check > simplified sf100 > (tpcds-modifiedQueries/q53)|https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/lastCompletedBuild/testReport/org.apache.spark.sql/TPCDSModifiedPlanStabilityWithStatsSuite/check_simplified_sf100__tpcds_modifiedQueries_q53_/]|57 > > ms|[2|https://amplab.cs.berkeley.edu/jenkins/job/spark-
[jira] [Assigned] (SPARK-35389) Analyzer should set progagateNull to false for magic function invocation
[ https://issues.apache.org/jira/browse/SPARK-35389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-35389: --- Assignee: Chao Sun > Analyzer should set progagateNull to false for magic function invocation > > > Key: SPARK-35389 > URL: https://issues.apache.org/jira/browse/SPARK-35389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > For both {{Invoke}} and {{StaticInvoke}} used by magic method of > {{ScalarFunction}}, we should set {{propgateNull}} to false, so that null > values will be passed to the UDF for evaluation, instead of bypassing that > and directly return null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35389) Analyzer should set progagateNull to false for magic function invocation
[ https://issues.apache.org/jira/browse/SPARK-35389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-35389. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 32553 [https://github.com/apache/spark/pull/32553] > Analyzer should set progagateNull to false for magic function invocation > > > Key: SPARK-35389 > URL: https://issues.apache.org/jira/browse/SPARK-35389 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.2.0 > > > For both {{Invoke}} and {{StaticInvoke}} used by magic method of > {{ScalarFunction}}, we should set {{propgateNull}} to false, so that null > values will be passed to the UDF for evaluation, instead of bypassing that > and directly return null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35089) non consistent results running count for same dataset after filter and lead window function
[ https://issues.apache.org/jira/browse/SPARK-35089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Domagoj updated SPARK-35089: Description: edit 2021-05-18 I have make it simpler to reproduce; I've put already generated data on s3 bucket that is publicly available with 24.000.000 records Now all you need to do is run this code: {code:java} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql._ import org.apache.spark.sql.functions._ val w = Window.partitionBy("user").orderBy("start") val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000)) spark.read.orc("s3://dtonzetic-spark-sample-data/sample-data.orc"). withColumn("end", ts_lead). withColumn("duration", col("end")-col("start")). where("type='TypeA' and duration>4").count() {code} this were my results: - run 1: 2547559 - run 2: 2547559 - run 3: 2547560 - run 4: 2547558 - run 5: 2547558 - run 6: 2547559 - run 7: 2547558 This results are from new EMR cluster, version 6.3.0, so nothing changed. end edit 2021-05-18 I have found an inconsistency with count function results after lead window function and filter. I have a dataframe (this is simplified version, but it's enough to reproduce) with millions of records, with these columns: * df1: ** start(timestamp) ** user_id(int) ** type(string) I need to define duration between two rows, and filter on that duration and type. I used window lead function to get the next event time (that define end for current event), so every row now gets start and stop times. If NULL (last row for example), add next midnight as stop. Data is stored in ORC file (tried with Parquet format, no difference) This only happens with multiple cluster nodes, for example AWS EMR cluster or local docker cluster setup. If I run it on single instance (local on laptop), I get consistent results every time. Spark version is 3.0.1, both in AWS and local and docker setup. Here is some simple code that you can use to reproduce it, I've used jupyterLab notebook on AWS EMR. Spark version is 3.0.1. {code:java} import org.apache.spark.sql.expressions.Window // this dataframe generation code should be executed only once, and data have to be saved, and then opened from disk, so it's always same. val getRandomUser = udf(()=>{ val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy") users(scala.util.Random.nextInt(7)) }) val getRandomType = udf(()=>{ val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE") types(scala.util.Random.nextInt(5)) }) val getRandomStart = udf((x:Int)=>{ x+scala.util.Random.nextInt(47) }) // for loop is used to avoid out of memory error during creation of dataframe for( a <- 0 to 23){ // use iterator a to continue with next million, repeat 1 mil times val x=Range(a*100,(a*100)+100).toDF("id"). withColumn("start",getRandomStart(col("id"))). withColumn("user",getRandomUser()). withColumn("type",getRandomType()). drop("id") x.write.mode("append").orc("hdfs:///random.orc") } // above code should be run only once, I used a cell in Jupyter // define window and lead val w = Window.partitionBy("user").orderBy("start") // if null, replace with 30.000.000 val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000)) // read data to dataframe, create stop column and calculate duration val fox2 = spark.read.orc("hdfs:///random.orc"). withColumn("end", ts_lead). withColumn("duration", col("end")-col("start")) // repeated executions of this line returns different results for count // I have it in separate cell in JupyterLab fox2.where("type='TypeA' and duration>4").count() {code} My results for three consecutive runs of last line were: * run 1: 2551259 * run 2: 2550756 * run 3: 2551279 It's very important to say that if I use filter: fox2.where("type='TypeA' ") or fox2.where("duration>4"), each of them can be executed repeatedly and I get consistent result every time. I can save dataframe after crating stop and duration columns, and after that, I get consistent results every time. It is not very practical workaround, as I need a lot of space and time to implement it. This dataset is really big (in my eyes at least, aprox 100.000.000 new records per day). If I run this same example on my local machine using master = local[*], everything works as expected, it's just on cluster setup. I tried to create cluster using docker on my local machine, created 3.0.1 and 3.1.1 clusters with one master and two workers, and have successfully reproduced issue. was: edit 2021-05-18 I have make it simpler to reproduce; I've put already generated data on s3 bucket that is publicly available with 24.000.000 records Now all you need to do is run this code: {code:java} import org.apache.spark.sql.expressions.Window import org.a
[jira] [Updated] (SPARK-35089) non consistent results running count for same dataset after filter and lead window function
[ https://issues.apache.org/jira/browse/SPARK-35089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Domagoj updated SPARK-35089: Description: edit 2021-05-18 I have make it simpler to reproduce; I've put already generated data on s3 bucket that is publicly available with 24.000.000 records Now all you need to do is run this code: {code:java} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql._ import org.apache.spark.sql.functions._ val w = Window.partitionBy("user").orderBy("start") val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000)) spark.read.orc("s3://dtonzetic-spark-sample-data/sample-data.orc"). withColumn("end", ts_lead). withColumn("duration", col("end")-col("start")). where("type='TypeA' and duration>4").count() {code} this were my results: - run 1: 2547559 - run 2: 2547559 - run 3: 2547560 - run 4: 2547558 - run 5: 2547558 - run 6: 2547559 - run 7: 2547558 This results are from new EMR cluster, version 6.3.0, so nothing changed. end edit 2021-05-18 I have found an inconsistency with count function results after lead window function and filter. I have a dataframe (this is simplified version, but it's enough to reproduce) with millions of records, with these columns: * df1: ** start(timestamp) ** user_id(int) ** type(string) I need to define duration between two rows, and filter on that duration and type. I used window lead function to get the next event time (that define end for current event), so every row now gets start and stop times. If NULL (last row for example), add next midnight as stop. Data is stored in ORC file (tried with Parquet format, no difference) This only happens with multiple cluster nodes, for example AWS EMR cluster or local docker cluster setup. If I run it on single instance (local on laptop), I get consistent results every time. Spark version is 3.0.1, both in AWS and local and docker setup. Here is some simple code that you can use to reproduce it, I've used jupyterLab notebook on AWS EMR. Spark version is 3.0.1. {code:java} import org.apache.spark.sql.expressions.Window // this dataframe generation code should be executed only once, and data have to be saved, and then opened from disk, so it's always same. val getRandomUser = udf(()=>{ val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy") users(scala.util.Random.nextInt(7)) }) val getRandomType = udf(()=>{ val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE") types(scala.util.Random.nextInt(5)) }) val getRandomStart = udf((x:Int)=>{ x+scala.util.Random.nextInt(47) }) // for loop is used to avoid out of memory error during creation of dataframe for( a <- 0 to 23){ // use iterator a to continue with next million, repeat 1 mil times val x=Range(a*100,(a*100)+100).toDF("id"). withColumn("start",getRandomStart(col("id"))). withColumn("user",getRandomUser()). withColumn("type",getRandomType()). drop("id") x.write.mode("append").orc("hdfs:///random.orc") } // above code should be run only once, I used a cell in Jupyter // define window and lead val w = Window.partitionBy("user").orderBy("start") // if null, replace with 30.000.000 val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000)) // read data to dataframe, create stop column and calculate duration val fox2 = spark.read.orc("hdfs:///random.orc"). withColumn("end", ts_lead). withColumn("duration", col("end")-col("start")) // repeated executions of this line returns different results for count // I have it in separate cell in JupyterLab fox2.where("type='TypeA' and duration>4").count() {code} My results for three consecutive runs of last line were: * run 1: 2551259 * run 2: 2550756 * run 3: 2551279 It's very important to say that if I use filter: fox2.where("type='TypeA' ") or fox2.where("duration>4"), each of them can be executed repeatedly and I get consistent result every time. I can save dataframe after crating stop and duration columns, and after that, I get consistent results every time. It is not very practical workaround, as I need a lot of space and time to implement it. This dataset is really big (in my eyes at least, aprox 100.000.000 new records per day). If I run this same example on my local machine using master = local[*], everything works as expected, it's just on cluster setup. I tried to create cluster using docker on my local machine, created 3.0.1 and 3.1.1 clusters with one master and two workers, and have successfully reproduced issue. was: edit 2021-05-18 I have make it simpler to reproduce; I've put already generated data on s3 bucket that is publicly available with 24.000.000 records Now all you need to do is run this code: {code:java} import org.apache.spark.sql.expressions.Window import org.apa
[jira] [Commented] (SPARK-35089) non consistent results running count for same dataset after filter and lead window function
[ https://issues.apache.org/jira/browse/SPARK-35089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346704#comment-17346704 ] Domagoj commented on SPARK-35089: - I've made test data available via public s3 bucket, so it's easier to reproduce now. > non consistent results running count for same dataset after filter and lead > window function > --- > > Key: SPARK-35089 > URL: https://issues.apache.org/jira/browse/SPARK-35089 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1, 3.1.1 >Reporter: Domagoj >Priority: Major > > edit 2021-05-18 > I have make it simpler to reproduce; I've put already generated data on s3 > bucket that is publicly available with 24.000.000 records > Now all you need to do is run this code: > {code:java} > import org.apache.spark.sql.expressions.Window > import org.apache.spark.sql._ > import org.apache.spark.sql.functions._ > val w = Window.partitionBy("user").orderBy("start") > val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000)) > spark.read.orc("s3://dtonzetic-spark-sample-data/sample-data.orc"). > withColumn("end", ts_lead). > withColumn("duration", col("end")-col("start")). > where("type='TypeA' and duration>4").count() > {code} > > this were my results: > - run 1: 2547559 > - run 2: 2547559 > - run 3: 2547560 > - run 4: 2547558 > - run 5: 2547558 > - run 6: 2547559 > - run 7: 2547558 > end edit 2021-05-18 > I have found an inconsistency with count function results after lead window > function and filter. > > I have a dataframe (this is simplified version, but it's enough to reproduce) > with millions of records, with these columns: > * df1: > ** start(timestamp) > ** user_id(int) > ** type(string) > I need to define duration between two rows, and filter on that duration and > type. I used window lead function to get the next event time (that define end > for current event), so every row now gets start and stop times. If NULL (last > row for example), add next midnight as stop. Data is stored in ORC file > (tried with Parquet format, no difference) > This only happens with multiple cluster nodes, for example AWS EMR cluster or > local docker cluster setup. If I run it on single instance (local on laptop), > I get consistent results every time. Spark version is 3.0.1, both in AWS and > local and docker setup. > Here is some simple code that you can use to reproduce it, I've used > jupyterLab notebook on AWS EMR. Spark version is 3.0.1. > > > {code:java} > import org.apache.spark.sql.expressions.Window > // this dataframe generation code should be executed only once, and data have > to be saved, and then opened from disk, so it's always same. > val getRandomUser = udf(()=>{ > val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy") >users(scala.util.Random.nextInt(7)) > }) > val getRandomType = udf(()=>{ > val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE") > types(scala.util.Random.nextInt(5)) > }) > val getRandomStart = udf((x:Int)=>{ > x+scala.util.Random.nextInt(47) > }) > // for loop is used to avoid out of memory error during creation of dataframe > for( a <- 0 to 23){ > // use iterator a to continue with next million, repeat 1 mil times > val x=Range(a*100,(a*100)+100).toDF("id"). > withColumn("start",getRandomStart(col("id"))). > withColumn("user",getRandomUser()). > withColumn("type",getRandomType()). > drop("id") > x.write.mode("append").orc("hdfs:///random.orc") > } > // above code should be run only once, I used a cell in Jupyter > // define window and lead > val w = Window.partitionBy("user").orderBy("start") > // if null, replace with 30.000.000 > val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000)) > // read data to dataframe, create stop column and calculate duration > val fox2 = spark.read.orc("hdfs:///random.orc"). > withColumn("end", ts_lead). > withColumn("duration", col("end")-col("start")) > // repeated executions of this line returns different results for count > // I have it in separate cell in JupyterLab > fox2.where("type='TypeA' and duration>4").count() > {code} > My results for three consecutive runs of last line were: > * run 1: 2551259 > * run 2: 2550756 > * run 3: 2551279 > It's very important to say that if I use filter: > fox2.where("type='TypeA' ") > or > fox2.where("duration>4"), > > each of them can be executed repeatedly and I get consistent result every > time. > I can save dataframe after crating stop and duration columns, and after that, > I get consistent results every time. > It is not very practical workaround, as I need a lot of space and time to > impl
[jira] [Updated] (SPARK-35089) non consistent results running count for same dataset after filter and lead window function
[ https://issues.apache.org/jira/browse/SPARK-35089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Domagoj updated SPARK-35089: Description: edit 2021-05-18 I have make it simpler to reproduce; I've put already generated data on s3 bucket that is publicly available with 24.000.000 records Now all you need to do is run this code: {code:java} import org.apache.spark.sql.expressions.Window import org.apache.spark.sql._ import org.apache.spark.sql.functions._ val w = Window.partitionBy("user").orderBy("start") val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000)) spark.read.orc("s3://dtonzetic-spark-sample-data/sample-data.orc"). withColumn("end", ts_lead). withColumn("duration", col("end")-col("start")). where("type='TypeA' and duration>4").count() {code} this were my results: - run 1: 2547559 - run 2: 2547559 - run 3: 2547560 - run 4: 2547558 - run 5: 2547558 - run 6: 2547559 - run 7: 2547558 end edit 2021-05-18 I have found an inconsistency with count function results after lead window function and filter. I have a dataframe (this is simplified version, but it's enough to reproduce) with millions of records, with these columns: * df1: ** start(timestamp) ** user_id(int) ** type(string) I need to define duration between two rows, and filter on that duration and type. I used window lead function to get the next event time (that define end for current event), so every row now gets start and stop times. If NULL (last row for example), add next midnight as stop. Data is stored in ORC file (tried with Parquet format, no difference) This only happens with multiple cluster nodes, for example AWS EMR cluster or local docker cluster setup. If I run it on single instance (local on laptop), I get consistent results every time. Spark version is 3.0.1, both in AWS and local and docker setup. Here is some simple code that you can use to reproduce it, I've used jupyterLab notebook on AWS EMR. Spark version is 3.0.1. {code:java} import org.apache.spark.sql.expressions.Window // this dataframe generation code should be executed only once, and data have to be saved, and then opened from disk, so it's always same. val getRandomUser = udf(()=>{ val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy") users(scala.util.Random.nextInt(7)) }) val getRandomType = udf(()=>{ val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE") types(scala.util.Random.nextInt(5)) }) val getRandomStart = udf((x:Int)=>{ x+scala.util.Random.nextInt(47) }) // for loop is used to avoid out of memory error during creation of dataframe for( a <- 0 to 23){ // use iterator a to continue with next million, repeat 1 mil times val x=Range(a*100,(a*100)+100).toDF("id"). withColumn("start",getRandomStart(col("id"))). withColumn("user",getRandomUser()). withColumn("type",getRandomType()). drop("id") x.write.mode("append").orc("hdfs:///random.orc") } // above code should be run only once, I used a cell in Jupyter // define window and lead val w = Window.partitionBy("user").orderBy("start") // if null, replace with 30.000.000 val ts_lead = coalesce(lead("start", 1) .over(w), lit(3000)) // read data to dataframe, create stop column and calculate duration val fox2 = spark.read.orc("hdfs:///random.orc"). withColumn("end", ts_lead). withColumn("duration", col("end")-col("start")) // repeated executions of this line returns different results for count // I have it in separate cell in JupyterLab fox2.where("type='TypeA' and duration>4").count() {code} My results for three consecutive runs of last line were: * run 1: 2551259 * run 2: 2550756 * run 3: 2551279 It's very important to say that if I use filter: fox2.where("type='TypeA' ") or fox2.where("duration>4"), each of them can be executed repeatedly and I get consistent result every time. I can save dataframe after crating stop and duration columns, and after that, I get consistent results every time. It is not very practical workaround, as I need a lot of space and time to implement it. This dataset is really big (in my eyes at least, aprox 100.000.000 new records per day). If I run this same example on my local machine using master = local[*], everything works as expected, it's just on cluster setup. I tried to create cluster using docker on my local machine, created 3.0.1 and 3.1.1 clusters with one master and two workers, and have successfully reproduced issue. was: I have found an inconsistency with count function results after lead window function and filter. I have a dataframe (this is simplified version, but it's enough to reproduce) with millions of records, with these columns: * df1: ** start(timestamp) ** user_id(int) ** type(string) I need to define duration between two rows, and filter on that d
[jira] [Created] (SPARK-35432) Expose TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions via Scala, Python and R APIs
Max Gekk created SPARK-35432: Summary: Expose TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions via Scala, Python and R APIs Key: SPARK-35432 URL: https://issues.apache.org/jira/browse/SPARK-35432 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.2.0 Reporter: Max Gekk The PR https://github.com/apache/spark/pull/28534 added new functions TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS but the functions are available to users only via SQL. To make other APIs (Scala/Java, PySpark and R) as powerful as SQL, need to implement the functions in the APIs too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35425) Add note about Jinja2 as a required dependency for document build.
[ https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17346682#comment-17346682 ] Apache Spark commented on SPARK-35425: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/32579 > Add note about Jinja2 as a required dependency for document build. > -- > > Key: SPARK-35425 > URL: https://issues.apache.org/jira/browse/SPARK-35425 > Project: Spark > Issue Type: Improvement > Components: docs >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > SPARK-35375 confined the version of Jinja to <3.0.0. > So it's good to note about it in docs/README.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35351) Add code-gen for left anti sort merge join
[ https://issues.apache.org/jira/browse/SPARK-35351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-35351. -- Fix Version/s: 3.2.0 Assignee: Cheng Su Resolution: Fixed Resolved by https://github.com/apache/spark/pull/32547 > Add code-gen for left anti sort merge join > -- > > Key: SPARK-35351 > URL: https://issues.apache.org/jira/browse/SPARK-35351 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Assignee: Cheng Su >Priority: Minor > Fix For: 3.2.0 > > > This Jira is to track the progress to add code-gen support for left anti sort > merge join. See motivation in SPARK-34705. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35256) Subexpression elimination leading to a performance regression
[ https://issues.apache.org/jira/browse/SPARK-35256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ondrej Kokes updated SPARK-35256: - Summary: Subexpression elimination leading to a performance regression (was: str_to_map + split performance regression) > Subexpression elimination leading to a performance regression > - > > Key: SPARK-35256 > URL: https://issues.apache.org/jira/browse/SPARK-35256 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1 >Reporter: Ondrej Kokes >Priority: Minor > Attachments: bisect_log.txt, bisect_timing.csv > > > I'm seeing almost double the runtime between 3.0.1 and 3.1.1 in my pipeline > that does mostly str_to_map, split and a few other operations - all > projections, no joins or aggregations (it's here only to trigger the > pipeline). I cut it down to the simplest reproducible example I could - > anything I remove from this changes the runtime difference quite > dramatically. (even moving those two expressions from f.when to standalone > columns makes the difference disappear) > {code:java} > import time > import os > import pyspark > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > if __name__ == '__main__': > print(pyspark.__version__) > spark = SparkSession.builder.getOrCreate() > filename = 'regression.csv' > if not os.path.isfile(filename): > with open(filename, 'wt') as fw: > fw.write('foo\n') > for _ in range(10_000_000): > fw.write('foo=bar&baz=bak&bar=f,o,1:2:3\n') > df = spark.read.option('header', True).csv(filename) > t = time.time() > dd = (df > .withColumn('my_map', f.expr('str_to_map(foo, "&", "=")')) > .withColumn('extracted', > # without this top level split it is only 50% > slower, with it > # the runtime almost doubles > f.split(f.split(f.col("my_map")["bar"], ",")[2], > ":")[0] >) > .select( > f.when( > f.col("extracted").startswith("foo"), f.col("extracted") > ).otherwise( > f.concat(f.lit("foo"), f.col("extracted")) > ).alias("foo") > ) > ) > # dd.explain(True) > _ = dd.groupby("foo").count().count() > print("elapsed", time.time() - t) > {code} > Running this in 3.0.1 and 3.1.1 respectively (both installed from PyPI, on my > local macOS) > {code:java} > 3.0.1 > elapsed 21.262351036071777 > 3.1.1 > elapsed 40.26582884788513 > {code} > (Meaning the transformation took 21 seconds in 3.0.1 and 40 seconds in 3.1.1) > Feel free to make the CSV smaller to get a quicker feedback loop - it scales > linearly (I developed this with 2M rows). > It might be related to my previous issue - SPARK-32989 - there are similar > operations, nesting etc. (splitting on the original column, not on a map, > makes the difference disappear) > I tried dissecting the queries in SparkUI and via explain, but both 3.0.1 and > 3.1.1 produced identical plans. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35420) Replace the usage of toStringHelper with ToStringBuilder
[ https://issues.apache.org/jira/browse/SPARK-35420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-35420: --- Fix Version/s: 3.2.0 > Replace the usage of toStringHelper with ToStringBuilder > > > Key: SPARK-35420 > URL: https://issues.apache.org/jira/browse/SPARK-35420 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.2.0 > > > SPARK-30272 removed the usage of Guava that breaks in 27 but toStringHelper > is introduced again. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-35425) Add note about Jinja2 as a required dependency for document build.
[ https://issues.apache.org/jira/browse/SPARK-35425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-35425. Target Version/s: 3.20 Resolution: Fixed This issue was resolved in https://github.com/apache/spark/pull/32573. > Add note about Jinja2 as a required dependency for document build. > -- > > Key: SPARK-35425 > URL: https://issues.apache.org/jira/browse/SPARK-35425 > Project: Spark > Issue Type: Improvement > Components: docs >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > > SPARK-35375 confined the version of Jinja to <3.0.0. > So it's good to note about it in docs/README.md -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org