[jira] [Resolved] (SPARK-48705) Explicitly use worker_main when it starts with pyspark
[ https://issues.apache.org/jira/browse/SPARK-48705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48705. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47077 [https://github.com/apache/spark/pull/47077] > Explicitly use worker_main when it starts with pyspark > -- > > Key: SPARK-48705 > URL: https://issues.apache.org/jira/browse/SPARK-48705 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 4.0.0 > > > After SPARK-44380, Python worker, the provided arguments must be used as a > worker main. It might break other cases such as > https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_worker.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48705) Explicitly use worker_main when it starts with pyspark
[ https://issues.apache.org/jira/browse/SPARK-48705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48705: Assignee: Hyukjin Kwon > Explicitly use worker_main when it starts with pyspark > -- > > Key: SPARK-48705 > URL: https://issues.apache.org/jira/browse/SPARK-48705 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > After SPARK-44380, Python worker, the provided arguments must be used as a > worker main. It might break other cases such as > https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_worker.py -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48706) Python UDF in higher order functions should not throw internal error
[ https://issues.apache.org/jira/browse/SPARK-48706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-48706: Assignee: Hyukjin Kwon > Python UDF in higher order functions should not throw internal error > > > Key: SPARK-48706 > URL: https://issues.apache.org/jira/browse/SPARK-48706 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > {code} > from pyspark.sql.functions import transform, udf, col, array > spark.range(1).select(transform(array("id"), lambda x: udf(lambda y: > y)(x))).collect() > {code} > throws an internal error: > {code} > at > org.apache.spark.SparkException$.internalError(SparkException.scala:88) > at > org.apache.spark.SparkException$.internalError(SparkException.scala:92) > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:73) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:507) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:506) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48706) Python UDF in higher order functions should not throw internal error
[ https://issues.apache.org/jira/browse/SPARK-48706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-48706. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47079 [https://github.com/apache/spark/pull/47079] > Python UDF in higher order functions should not throw internal error > > > Key: SPARK-48706 > URL: https://issues.apache.org/jira/browse/SPARK-48706 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 4.0.0 > > > {code} > from pyspark.sql.functions import transform, udf, col, array > spark.range(1).select(transform(array("id"), lambda x: udf(lambda y: > y)(x))).collect() > {code} > throws an internal error: > {code} > at > org.apache.spark.SparkException$.internalError(SparkException.scala:88) > at > org.apache.spark.SparkException$.internalError(SparkException.scala:92) > at > org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:73) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:507) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:506) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48719) Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL
[ https://issues.apache.org/jira/browse/SPARK-48719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jonathon Lee updated SPARK-48719: - Description: When calculate slope and intercept using regr_slope & regr_intercept aggregate: (using Java api) {code:java} spark.sql("drop table if exists tab"); spark.sql("CREATE TABLE tab(y int, x int) using parquet"); spark.sql("INSERT INTO tab VALUES (1, 1)"); spark.sql("INSERT INTO tab VALUES (2, 3)"); spark.sql("INSERT INTO tab VALUES (3, 5)"); spark.sql("INSERT INTO tab VALUES (NULL, 3)"); spark.sql("INSERT INTO tab VALUES (3, NULL)"); spark.sql("SELECT " + "regr_slope(x, y), " + "regr_intercept(x, y)" + "FROM tab").show(); {code} Spark result: {code:java} +--++ | regr_slope(x, y)|regr_intercept(x, y)| +--++ |1.4545454545454546| 0.09090909090909083| +--++ {code} The correct answer should be 2.0 and -1.0 obviously. Reason: In sql/catalyst/expressions/aggregate/linearRegression.scala, {code:java} case class RegrSlope(left: Expression, right: Expression) extends DeclarativeAggregate with ImplicitCastInputTypes with BinaryLike[Expression] { private val covarPop = new CovPopulation(right, left) private val varPop = new VariancePop(right) .. {code} CovPopulation will filter tuples which right *OR* left is NULL But VariancePop will only filter null right expression. This will cause wrong result when some of the tuples' left is null (and right is not null). {*}Same reason with RegrIntercept{*}. A possible fix: {code:java} case class RegrSlope(left: Expression, right: Expression) extends DeclarativeAggregate with ImplicitCastInputTypes with BinaryLike[Expression] { private val covarPop = new CovPopulation(right, left) private val varPop = new VariancePop(If(And(IsNotNull(left), IsNotNull(right)), right, Literal.create(null, right.dataType))) .{code} *same fix to RegrIntercept* was: When calculate slope and intercept using regr_slope & regr_intercept aggregate: (using Java api) {code:java} spark.sql("drop table if exists tab"); spark.sql("CREATE TABLE tab(y int, x int) using parquet"); spark.sql("INSERT INTO tab VALUES (1, 1)"); spark.sql("INSERT INTO tab VALUES (2, 3)"); spark.sql("INSERT INTO tab VALUES (3, 5)"); spark.sql("INSERT INTO tab VALUES (NULL, 3)"); spark.sql("INSERT INTO tab VALUES (3, NULL)"); spark.sql("SELECT " + "regr_slope(x, y), " + "regr_intercept(x, y)" + "FROM tab").show(); {code} Spark result: {code:java} +--++ | regr_slope(x, y)|regr_intercept(x, y)| +--++ |1.4545454545454546| 0.09090909090909083| +--++ {code} The correct answer should be 2.0 and -1.0 obviously. Reason: In sql/catalyst/expressions/aggregate/linearRegression.scala, {code:java} case class RegrSlope(left: Expression, right: Expression) extends DeclarativeAggregate with ImplicitCastInputTypes with BinaryLike[Expression] { private val covarPop = new CovPopulation(right, left) private val varPop = new VariancePop(right) .. {code} CovPopulation will filter tuples which right *OR* left is NULL But VariancePop will only filter null right expression. This will cause wrong result when some of the tuples' left is null (and right is not null). Same reason with RegrIntercept. A possible fix: {code:java} case class RegrSlope(left: Expression, right: Expression) extends DeclarativeAggregate with ImplicitCastInputTypes with BinaryLike[Expression] { private val covarPop = new CovPopulation(right, left) private val varPop = new VariancePop(If(And(IsNotNull(left), IsNotNull(right)), right, Literal.create(null, right.dataType))) .{code} > Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL > > > Key: SPARK-48719 > URL: https://issues.apache.org/jira/browse/SPARK-48719 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 3.4.0 >Reporter: Jonathon Lee >Priority: Major > > When calculate slope and intercept using regr_slope & regr_intercept > aggregate: > (using Java api) > {code:java} > spark.sql("drop table if exists tab"); > spark.sql("CREATE TABLE tab(y int, x int) using parquet"); > spark.sql("INSERT INTO tab VALUES (1, 1)"); > spark.sql("INSERT INTO tab VALUES (2, 3)"); > spark.sql("INSERT INTO tab VALUES (3, 5)"); > spark.sql("INSERT INTO tab VALUES (NULL, 3)"); > spark.sql("INSERT INTO tab VALUES (3, NULL)"); > spark.sql("SELECT " + > "regr_slope(x, y), " + > "regr_intercept(x, y)" + > "FROM tab").show(); {code} > Spark result: > {code:java
[jira] [Created] (SPARK-48719) Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL
Jonathon Lee created SPARK-48719: Summary: Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL Key: SPARK-48719 URL: https://issues.apache.org/jira/browse/SPARK-48719 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 3.4.0 Reporter: Jonathon Lee When calculate slope and intercept using regr_slope & regr_intercept aggregate: (using Java api) {code:java} spark.sql("drop table if exists tab"); spark.sql("CREATE TABLE tab(y int, x int) using parquet"); spark.sql("INSERT INTO tab VALUES (1, 1)"); spark.sql("INSERT INTO tab VALUES (2, 3)"); spark.sql("INSERT INTO tab VALUES (3, 5)"); spark.sql("INSERT INTO tab VALUES (NULL, 3)"); spark.sql("INSERT INTO tab VALUES (3, NULL)"); spark.sql("SELECT " + "regr_slope(x, y), " + "regr_intercept(x, y)" + "FROM tab").show(); {code} Spark result: {code:java} +--++ | regr_slope(x, y)|regr_intercept(x, y)| +--++ |1.4545454545454546| 0.09090909090909083| +--++ {code} The correct answer should be 2.0 and -1.0 obviously. Reason: In sql/catalyst/expressions/aggregate/linearRegression.scala, {code:java} case class RegrSlope(left: Expression, right: Expression) extends DeclarativeAggregate with ImplicitCastInputTypes with BinaryLike[Expression] { private val covarPop = new CovPopulation(right, left) private val varPop = new VariancePop(right) .. {code} CovPopulation will filter tuples which right *OR* left is NULL But VariancePop will only filter null right expression. This will cause wrong result when some of the tuples' left is null (and right is not null). Same reason with RegrIntercept. A possible fix: {code:java} case class RegrSlope(left: Expression, right: Expression) extends DeclarativeAggregate with ImplicitCastInputTypes with BinaryLike[Expression] { private val covarPop = new CovPopulation(right, left) private val varPop = new VariancePop(If(And(IsNotNull(left), IsNotNull(right)), right, Literal.create(null, right.dataType))) .{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48573) Upgrade ICU version
[ https://issues.apache.org/jira/browse/SPARK-48573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-48573. -- Target Version/s: 4.0.0 Assignee: Mihailo Milosevic Resolution: Fixed > Upgrade ICU version > --- > > Key: SPARK-48573 > URL: https://issues.apache.org/jira/browse/SPARK-48573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Assignee: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48573) Upgrade ICU version
[ https://issues.apache.org/jira/browse/SPARK-48573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-48573: - Fix Version/s: 4.0.0 > Upgrade ICU version > --- > > Key: SPARK-48573 > URL: https://issues.apache.org/jira/browse/SPARK-48573 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Mihailo Milosevic >Assignee: Mihailo Milosevic >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48718) Got incastable error when deserializer in cogroup is resolved during application of DeduplicateRelation rule
[ https://issues.apache.org/jira/browse/SPARK-48718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48718. - Fix Version/s: 4.0.0 Assignee: Xinyi Yu Resolution: Fixed > Got incastable error when deserializer in cogroup is resolved during > application of DeduplicateRelation rule > > > Key: SPARK-48718 > URL: https://issues.apache.org/jira/browse/SPARK-48718 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Xinyi Yu >Assignee: Xinyi Yu >Priority: Major > Fix For: 4.0.0 > > > When running the following commands: > > > {code:java} > val lhs = spark.createDataFrame( > List(Row(123L)).asJava, > StructType(Seq(StructField("GROUPING_KEY", LongType))) > ) > val rhs = spark.createDataFrame( > List(Row(0L, 123L)).asJava, > StructType(Seq(StructField("ID", LongType), StructField("GROUPING_KEY", > LongType))) > ) val lhsKV = lhs.groupByKey((r: Row) => r.getAs[Long]("GROUPING_KEY")) > val rhsKV = rhs.groupByKey((r: Row) => r.getAs[Long]("GROUPING_KEY")) > val cogrouped = lhsKV.cogroup(rhsKV)( > (a: Long, b: Iterator[Row], c: Iterator[Row]) => Iterator(0L) > ) > val joined = rhs.join(cogrouped, col("ID") === col("value"), "left") > {code} > > > It gets an error: > {code:java} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull cannot be > cast to org.apache.spark.sql.catalyst.analysis.UnresolvedDeserializer {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43781) IllegalStateException when cogrouping two datasets derived from the same source
[ https://issues.apache.org/jira/browse/SPARK-43781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-43781: Fix Version/s: 4.0.0 (was: 3.5.0) > IllegalStateException when cogrouping two datasets derived from the same > source > --- > > Key: SPARK-43781 > URL: https://issues.apache.org/jira/browse/SPARK-43781 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1, 3.4.0 > Environment: Reproduces in a unit test, using Spark 3.3.1, the Java > API, and a {{local[2]}} SparkSession. >Reporter: Derek Murray >Assignee: Jia Fan >Priority: Major > Fix For: 4.0.0 > > > Attempting to {{cogroup}} two datasets derived from the same source dataset > yields an {{IllegalStateException}} when the query is executed. > Minimal reproducer: > {code:java} > StructType inputType = DataTypes.createStructType( > new StructField[]{ > DataTypes.createStructField("id", DataTypes.LongType, false), > DataTypes.createStructField("type", DataTypes.StringType, false) > } > ); > StructType keyType = DataTypes.createStructType( > new StructField[]{ > DataTypes.createStructField("id", DataTypes.LongType, false) > } > ); > List inputRows = new ArrayList<>(); > inputRows.add(RowFactory.create(1L, "foo")); > inputRows.add(RowFactory.create(1L, "bar")); > inputRows.add(RowFactory.create(2L, "foo")); > Dataset input = sparkSession.createDataFrame(inputRows, inputType); > KeyValueGroupedDataset fooGroups = input > .filter("type = 'foo'") > .groupBy("id") > .as(RowEncoder.apply(keyType), RowEncoder.apply(inputType)); > KeyValueGroupedDataset barGroups = input > .filter("type = 'bar'") > .groupBy("id") > .as(RowEncoder.apply(keyType), RowEncoder.apply(inputType)); > Dataset result = fooGroups.cogroup( > barGroups, > (CoGroupFunction) (row, iterator, iterator1) -> new > ArrayList().iterator(), > RowEncoder.apply(inputType)); > result.explain(); > result.show();{code} > Explain output (note mismatch in column IDs between Sort/Exchagne and > LocalTableScan on the first input to the CoGroup): > {code:java} > == Physical Plan == > AdaptiveSparkPlan isFinalPlan=false > +- SerializeFromObject > [validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 0, id), LongType, false) AS id#37L, > staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, > fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, > org.apache.spark.sql.Row, true]), 1, type), StringType, false), true, false, > true) AS type#38] > +- CoGroup > org.apache.spark.sql.KeyValueGroupedDataset$$Lambda$1478/1869116781@77856cc5, > createexternalrow(id#16L, StructField(id,LongType,false)), > createexternalrow(id#16L, type#17.toString, StructField(id,LongType,false), > StructField(type,StringType,false)), createexternalrow(id#16L, > type#17.toString, StructField(id,LongType,false), > StructField(type,StringType,false)), [id#39L], [id#39L], [id#39L, type#40], > [id#39L, type#40], obj#36: org.apache.spark.sql.Row > :- !Sort [id#39L ASC NULLS FIRST], false, 0 > : +- !Exchange hashpartitioning(id#39L, 2), ENSURE_REQUIREMENTS, > [plan_id=19] > : +- LocalTableScan [id#16L, type#17] > +- Sort [id#39L ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(id#39L, 2), ENSURE_REQUIREMENTS, > [plan_id=20] > +- LocalTableScan [id#39L, type#40]{code} > Exception: > {code:java} > java.lang.IllegalStateException: Couldn't find id#39L in [id#16L,type#17] > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80) > at > org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589) > at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75) > at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35) > at > org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$tra
[jira] [Created] (SPARK-48718) Got incastable error when deserializer in cogroup is resolved during application of DeduplicateRelation rule
Xinyi Yu created SPARK-48718: Summary: Got incastable error when deserializer in cogroup is resolved during application of DeduplicateRelation rule Key: SPARK-48718 URL: https://issues.apache.org/jira/browse/SPARK-48718 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 4.0.0 Reporter: Xinyi Yu When running the following commands: {code:java} val lhs = spark.createDataFrame( List(Row(123L)).asJava, StructType(Seq(StructField("GROUPING_KEY", LongType))) ) val rhs = spark.createDataFrame( List(Row(0L, 123L)).asJava, StructType(Seq(StructField("ID", LongType), StructField("GROUPING_KEY", LongType))) ) val lhsKV = lhs.groupByKey((r: Row) => r.getAs[Long]("GROUPING_KEY")) val rhsKV = rhs.groupByKey((r: Row) => r.getAs[Long]("GROUPING_KEY")) val cogrouped = lhsKV.cogroup(rhsKV)( (a: Long, b: Iterator[Row], c: Iterator[Row]) => Iterator(0L) ) val joined = rhs.join(cogrouped, col("ID") === col("value"), "left") {code} It gets an error: {code:java} java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull cannot be cast to org.apache.spark.sql.catalyst.analysis.UnresolvedDeserializer {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48717) Python foreachBatch streaming query cannot be stopped gracefully after pin thread mode is enabled and is running spark queries
Wei Liu created SPARK-48717: --- Summary: Python foreachBatch streaming query cannot be stopped gracefully after pin thread mode is enabled and is running spark queries Key: SPARK-48717 URL: https://issues.apache.org/jira/browse/SPARK-48717 Project: Spark Issue Type: New Feature Components: PySpark, SS Affects Versions: 4.0.0 Reporter: Wei Liu Followup of https://issues.apache.org/jira/browse/SPARK-39218 It only considered the InterruptedException is thrown when time.sleep(10) is intercepted. But when a spark query is executing: {code:java} def func(batch_df, batch_id): batch_df.sparkSession.range(1000).write.saveAsTable("oops") print(batch_df.count()) {code} the actual error would be: {code:java} py4j.protocol.Py4JJavaError: An error occurred while calling o2141502.saveAsTable. : org.apache.spark.SparkException: Job aborted. ... at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$2(StreamExecution.scala:262) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:262) *Caused by: java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1000)* at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1308) {code} We should also add consideration to this scenario -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48638) Native QueryExecution information for the dataframe
[ https://issues.apache.org/jira/browse/SPARK-48638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-48638. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46996 [https://github.com/apache/spark/pull/46996] > Native QueryExecution information for the dataframe > --- > > Key: SPARK-48638 > URL: https://issues.apache.org/jira/browse/SPARK-48638 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Adding a new property to `DataFrame` called `queryExecution` that returns a > class that contains information about the query execution and it's metrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48638) Native QueryExecution information for the dataframe
[ https://issues.apache.org/jira/browse/SPARK-48638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-48638: Assignee: Martin Grund > Native QueryExecution information for the dataframe > --- > > Key: SPARK-48638 > URL: https://issues.apache.org/jira/browse/SPARK-48638 > Project: Spark > Issue Type: Improvement > Components: Connect, PySpark >Affects Versions: 4.0.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Labels: pull-request-available > > Adding a new property to `DataFrame` called `queryExecution` that returns a > class that contains information about the query execution and it's metrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48716) Add JobGroupId to onSqlStart
Lingkai Kong created SPARK-48716: Summary: Add JobGroupId to onSqlStart Key: SPARK-48716 URL: https://issues.apache.org/jira/browse/SPARK-48716 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.5.1 Reporter: Lingkai Kong -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-39901) Reconsider design of ignoreCorruptFiles feature
[ https://issues.apache.org/jira/browse/SPARK-39901 ] Wei Guo deleted comment on SPARK-39901: - was (Author: wayne guo): The `ignoreCorruptFiles` features in SQL(spark.sql.files.ignoreCorruptFiles) and RDD(spark.files.ignoreCorruptFiles) scenarios need to be included both. > Reconsider design of ignoreCorruptFiles feature > --- > > Key: SPARK-39901 > URL: https://issues.apache.org/jira/browse/SPARK-39901 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Major > > I'm filing this ticket as a followup to the discussion at > [https://github.com/apache/spark/pull/36775#issuecomment-1148136217] > regarding the `ignoreCorruptFiles` feature: the current implementation is > based towards considering a broad range of IOExceptions to be corruption, but > this is likely overly-broad and might mis-identify transient errors as > corruption (causing non-corrupt data to be erroneously discarded). > SPARK-39389 fixes one instance of that problem, but we are still vulnerable > to similar issues because of the overall design of this feature. > I think we should reconsider the design of this feature: maybe we should > switch the default behavior so that only an explicit allowlist of known > corruption exceptions can cause files to be skipped. This could be done > through involvement of other parts of the code, e.g. rewrapping exceptions > into a `CorruptFileException` so higher layers can positively identify > corruption. > Any changes to behavior here could potentially impact users jobs, so we'd > need to think carefully about when we want to change (in a 3.x release? 4.x?) > and how we want to provide escape hatches (e.g. configs to revert back to old > behavior). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48715) UTF8String - Java String conversions should use Unicode replacement logic
Uroš Bojanić created SPARK-48715: Summary: UTF8String - Java String conversions should use Unicode replacement logic Key: SPARK-48715 URL: https://issues.apache.org/jira/browse/SPARK-48715 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Uroš Bojanić -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48578) Add new expressions for UTF8 string validation
[ https://issues.apache.org/jira/browse/SPARK-48578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-48578. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 46845 [https://github.com/apache/spark/pull/46845] > Add new expressions for UTF8 string validation > -- > > Key: SPARK-48578 > URL: https://issues.apache.org/jira/browse/SPARK-48578 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48578) Add new expressions for UTF8 string validation
[ https://issues.apache.org/jira/browse/SPARK-48578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-48578: --- Assignee: Uroš Bojanić > Add new expressions for UTF8 string validation > -- > > Key: SPARK-48578 > URL: https://issues.apache.org/jira/browse/SPARK-48578 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Uroš Bojanić >Assignee: Uroš Bojanić >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-47927) Nullability after join not respected in UDF
[ https://issues.apache.org/jira/browse/SPARK-47927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859940#comment-17859940 ] GridGain Integration commented on SPARK-47927: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/47081 > Nullability after join not respected in UDF > --- > > Key: SPARK-47927 > URL: https://issues.apache.org/jira/browse/SPARK-47927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0, 3.5.1, 3.4.3 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Major > Labels: correctness, pull-request-available > Fix For: 4.0.0, 3.5.2, 3.4.4 > > > {code:java} > val ds1 = Seq(1).toDS() > val ds2 = Seq[Int]().toDS() > val f = udf[(Int, Option[Int]), (Int, Option[Int])](identity) > ds1.join(ds2, ds1("value") === ds2("value"), > "outer").select(f(struct(ds1("value"), ds2("value".show() > ds1.join(ds2, ds1("value") === ds2("value"), > "outer").select(struct(ds1("value"), ds2("value"))).show() {code} > outputs > {code:java} > +---+ > |UDF(struct(value, value, value, value))| > +---+ > | {1, 0}| > +---+ > ++ > |struct(value, value)| > ++ > | {1, NULL}| > ++ {code} > So when the result is passed to UDF the null-ability after the the join is > not respected and we incorrectly end up with a 0 value instead of a null/None > value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48714) Implement df.mergeInto in PySpark
Pengfei Xu created SPARK-48714: -- Summary: Implement df.mergeInto in PySpark Key: SPARK-48714 URL: https://issues.apache.org/jira/browse/SPARK-48714 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 4.0.0 Reporter: Pengfei Xu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48713) Add index range check for UnsafeRow.pointTo when baseObject is byte array
wuyi created SPARK-48713: Summary: Add index range check for UnsafeRow.pointTo when baseObject is byte array Key: SPARK-48713 URL: https://issues.apache.org/jira/browse/SPARK-48713 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: wuyi -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-48712) Perf Improvement for Encode with empty string and UTF-8 charset
[ https://issues.apache.org/jira/browse/SPARK-48712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-48712: - Description: Apple M2 Max encode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative -UTF-8 3672 3697 22 5.4 183.6 1.0X +UTF-8 79270 79698 448 0.3 3963.5 1.0X > Perf Improvement for Encode with empty string and UTF-8 charset > --- > > Key: SPARK-48712 > URL: https://issues.apache.org/jira/browse/SPARK-48712 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > > Apple M2 Max > encode: Best Time(ms) Avg Time(ms) > Stdev(ms) Rate(M/s) Per Row(ns) Relative > > > -UTF-8 3672 3697 > 22 5.4 183.6 1.0X > +UTF-8 79270 79698 > 448 0.3 3963.5 1.0X -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48712) Perf Improvement for Encode with empty string and UTF-8 charset
Kent Yao created SPARK-48712: Summary: Perf Improvement for Encode with empty string and UTF-8 charset Key: SPARK-48712 URL: https://issues.apache.org/jira/browse/SPARK-48712 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-48693) simplify and unify toString of Invoke and StaticInvoke
[ https://issues.apache.org/jira/browse/SPARK-48693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-48693. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 47066 [https://github.com/apache/spark/pull/47066] > simplify and unify toString of Invoke and StaticInvoke > -- > > Key: SPARK-48693 > URL: https://issues.apache.org/jira/browse/SPARK-48693 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48711) OOM killer may leave SparkContext in broken state causing ConnectionRefusedError
Rafal Wojdyla created SPARK-48711: - Summary: OOM killer may leave SparkContext in broken state causing ConnectionRefusedError Key: SPARK-48711 URL: https://issues.apache.org/jira/browse/SPARK-48711 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 3.5.0 Reporter: Rafal Wojdyla Related to https://issues.apache.org/jira/browse/SPARK-18523, and https://github.com/apache/spark/pull/15961. I'm currently on: {code} pyspark 3.5.0 pyhd8ed1ab_0conda-forge py4j 0.10.9.7 pyhd8ed1ab_0conda-forge {code} When Spark JVM process gets OOM-Killed, `SparkContext.stop` fails with `ConnectionRefusedError`, which leaves the `SparkSession/Context` in a "dirty" state. https://issues.apache.org/jira/browse/SPARK-18523 addressed this by catching the {{Py4JError}} it looks like the code now raises {{ConnectionRefusedError}}: {code} Traceback (most recent call last): ... File "/lib/python3.11/site-packages/pyspark/sql/session.py", line 1796, in stop self._sc.stop() File "/lib/python3.11/site-packages/pyspark/context.py", line 654, in stop self._jsc.stop() File "/lib/python3.11/site-packages/py4j/java_gateway.py", line 1321, in __call__ answer = self.gateway_client.send_command(command) ^ File "/lib/python3.11/site-packages/py4j/java_gateway.py", line 1036, in send_command connection = self._get_connection() ^^ File "/lib/python3.11/site-packages/py4j/clientserver.py", line 284, in _get_connection connection = self._create_new_connection() ^ File "/lib/python3.11/site-packages/py4j/clientserver.py", line 291, in _create_new_connection connection.connect_to_java_server() File "/lib/python3.11/site-packages/py4j/clientserver.py", line 438, in connect_to_java_server self.socket.connect((self.java_address, self.java_port)) ConnectionRefusedError: [Errno 111] Connection refused {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48710) Incompatibilities with NumPy 2.0
Patrick Marx created SPARK-48710: Summary: Incompatibilities with NumPy 2.0 Key: SPARK-48710 URL: https://issues.apache.org/jira/browse/SPARK-48710 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 4.0.0 Reporter: Patrick Marx PySpark references some code which was removed with NumPy 2.0. {{python/pyspark/pandas/strings.py}}: * {{np.NaN}} was removed, should be replaced with {{np.nan}} {{python/pyspark/pandas/typedef/typehints.py}} * {{np.string_}} was removed, [is an alias for|https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3134] {{np.bytes_}} * {{np.float_}} was removed, [is defined the same as|https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3042-3043] {{np.double}} * {{np.unicode_}} was removed, [is an alias for|https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3148] {{np.str_}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48698) Support analyze column stats for tables with collated columns
[ https://issues.apache.org/jira/browse/SPARK-48698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48698: -- Assignee: (was: Apache Spark) > Support analyze column stats for tables with collated columns > - > > Key: SPARK-48698 > URL: https://issues.apache.org/jira/browse/SPARK-48698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > > Following sequence fails: > {code:java} > > create table t(s string collate utf8_lcase) using parquet; > > insert into t values ('A'); > > analyze table t compute statistics for all columns; > [UNSUPPORTED_FEATURE.ANALYZE_UNSUPPORTED_COLUMN_TYPE] The feature is not > supported: The ANALYZE TABLE FOR COLUMNS command does not support the type > "STRING COLLATE UTF8_LCASE" of the column `s` in the table > `spark_catalog`.`default`.`t`. SQLSTATE: 0A000 > {code} > Users should be able to run ANALYZE commands on tables which have columns > with collated type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48698) Support analyze column stats for tables with collated columns
[ https://issues.apache.org/jira/browse/SPARK-48698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48698: -- Assignee: (was: Apache Spark) > Support analyze column stats for tables with collated columns > - > > Key: SPARK-48698 > URL: https://issues.apache.org/jira/browse/SPARK-48698 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Nikola Mandic >Priority: Major > > Following sequence fails: > {code:java} > > create table t(s string collate utf8_lcase) using parquet; > > insert into t values ('A'); > > analyze table t compute statistics for all columns; > [UNSUPPORTED_FEATURE.ANALYZE_UNSUPPORTED_COLUMN_TYPE] The feature is not > supported: The ANALYZE TABLE FOR COLUMNS command does not support the type > "STRING COLLATE UTF8_LCASE" of the column `s` in the table > `spark_catalog`.`default`.`t`. SQLSTATE: 0A000 > {code} > Users should be able to run ANALYZE commands on tables which have columns > with collated type. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-48177) Upgrade `Parquet` to 1.14.1
[ https://issues.apache.org/jira/browse/SPARK-48177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-48177: -- Assignee: Apache Spark (was: Fokko Driesprong) > Upgrade `Parquet` to 1.14.1 > --- > > Key: SPARK-48177 > URL: https://issues.apache.org/jira/browse/SPARK-48177 > Project: Spark > Issue Type: Sub-task > Components: Build >Affects Versions: 4.0.0 >Reporter: Fokko Driesprong >Assignee: Apache Spark >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-48709) varchar resolution mismatch for DataSourceV2 CTAS
Yuming Wang created SPARK-48709: --- Summary: varchar resolution mismatch for DataSourceV2 CTAS Key: SPARK-48709 URL: https://issues.apache.org/jira/browse/SPARK-48709 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1, 3.5.0, 4.0.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org