date:20240625

[jira] [Resolved] (SPARK-48705) Explicitly use worker_main when it starts with pyspark

2024-06-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48705.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47077
[https://github.com/apache/spark/pull/47077]

> Explicitly use worker_main when it starts with pyspark
> --
>
> Key: SPARK-48705
> URL: https://issues.apache.org/jira/browse/SPARK-48705
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 4.0.0
>
>
> After SPARK-44380, Python worker, the provided arguments must be used as a 
> worker main. It might break other cases such as 
> https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_worker.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48705) Explicitly use worker_main when it starts with pyspark

2024-06-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48705:


Assignee: Hyukjin Kwon

> Explicitly use worker_main when it starts with pyspark
> --
>
> Key: SPARK-48705
> URL: https://issues.apache.org/jira/browse/SPARK-48705
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> After SPARK-44380, Python worker, the provided arguments must be used as a 
> worker main. It might break other cases such as 
> https://github.com/getsentry/sentry-python/blob/master/sentry_sdk/integrations/spark/spark_worker.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48706) Python UDF in higher order functions should not throw internal error

2024-06-25 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-48706:


Assignee: Hyukjin Kwon

> Python UDF in higher order functions should not throw internal error
> 
>
> Key: SPARK-48706
> URL: https://issues.apache.org/jira/browse/SPARK-48706
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> {code}
> from pyspark.sql.functions import transform, udf, col, array
> spark.range(1).select(transform(array("id"), lambda x: udf(lambda y: 
> y)(x))).collect()
> {code}
> throws an internal error:
> {code}
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:88)
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:92)
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:507)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:506)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48706) Python UDF in higher order functions should not throw internal error

2024-06-25 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48706.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47079
[https://github.com/apache/spark/pull/47079]

> Python UDF in higher order functions should not throw internal error
> 
>
> Key: SPARK-48706
> URL: https://issues.apache.org/jira/browse/SPARK-48706
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 4.0.0
>
>
> {code}
> from pyspark.sql.functions import transform, udf, col, array
> spark.range(1).select(transform(array("id"), lambda x: udf(lambda y: 
> y)(x))).collect()
> {code}
> throws an internal error:
> {code}
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:88)
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:92)
>   at 
> org.apache.spark.sql.errors.QueryExecutionErrors$.cannotEvaluateExpressionError(QueryExecutionErrors.scala:73)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval(Expression.scala:507)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.eval$(Expression.scala:506)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48719) Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL

2024-06-25 Thread Jonathon Lee (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathon Lee updated SPARK-48719:
-
Description: 
When calculate slope and intercept using regr_slope & regr_intercept aggregate:

(using Java api)
{code:java}
spark.sql("drop table if exists tab");
spark.sql("CREATE TABLE tab(y int, x int) using parquet");
spark.sql("INSERT INTO tab VALUES (1, 1)");
spark.sql("INSERT INTO tab VALUES (2, 3)");
spark.sql("INSERT INTO tab VALUES (3, 5)");
spark.sql("INSERT INTO tab VALUES (NULL, 3)");
spark.sql("INSERT INTO tab VALUES (3, NULL)");
spark.sql("SELECT " +
"regr_slope(x, y), " +
"regr_intercept(x, y)" +
"FROM tab").show(); {code}
Spark result:
{code:java}
+--++
|  regr_slope(x, y)|regr_intercept(x, y)|
+--++
|1.4545454545454546| 0.09090909090909083|
+--++ {code}
The correct answer should be 2.0 and -1.0 obviously.

 

Reason:

In sql/catalyst/expressions/aggregate/linearRegression.scala,

 
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends 
DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(right)
.. {code}
CovPopulation will filter tuples which right *OR* left is NULL

But VariancePop will only filter null right expression.

This will cause wrong result when some of the tuples' left is null (and right 
is not null).

{*}Same reason with RegrIntercept{*}.

 

A possible fix:
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends 
DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(If(And(IsNotNull(left), 
IsNotNull(right)),
right, Literal.create(null, right.dataType))) 
.{code}
*same fix to RegrIntercept*

  was:
When calculate slope and intercept using regr_slope & regr_intercept aggregate:

(using Java api)
{code:java}
spark.sql("drop table if exists tab");
spark.sql("CREATE TABLE tab(y int, x int) using parquet");
spark.sql("INSERT INTO tab VALUES (1, 1)");
spark.sql("INSERT INTO tab VALUES (2, 3)");
spark.sql("INSERT INTO tab VALUES (3, 5)");
spark.sql("INSERT INTO tab VALUES (NULL, 3)");
spark.sql("INSERT INTO tab VALUES (3, NULL)");
spark.sql("SELECT " +
"regr_slope(x, y), " +
"regr_intercept(x, y)" +
"FROM tab").show(); {code}
Spark result:
{code:java}
+--++
|  regr_slope(x, y)|regr_intercept(x, y)|
+--++
|1.4545454545454546| 0.09090909090909083|
+--++ {code}
The correct answer should be 2.0 and -1.0 obviously.

 

Reason:

In sql/catalyst/expressions/aggregate/linearRegression.scala,

 
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends 
DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(right)
.. {code}
CovPopulation will filter tuples which right *OR* left is NULL

But VariancePop will only filter null right expression.

This will cause wrong result when some of the tuples' left is null (and right 
is not null).

Same reason with RegrIntercept.

 

A possible fix:
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends 
DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(If(And(IsNotNull(left), 
IsNotNull(right)),
right, Literal.create(null, right.dataType))) 
.{code}
 


> Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL
> 
>
> Key: SPARK-48719
> URL: https://issues.apache.org/jira/browse/SPARK-48719
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.0
>Reporter: Jonathon Lee
>Priority: Major
>
> When calculate slope and intercept using regr_slope & regr_intercept 
> aggregate:
> (using Java api)
> {code:java}
> spark.sql("drop table if exists tab");
> spark.sql("CREATE TABLE tab(y int, x int) using parquet");
> spark.sql("INSERT INTO tab VALUES (1, 1)");
> spark.sql("INSERT INTO tab VALUES (2, 3)");
> spark.sql("INSERT INTO tab VALUES (3, 5)");
> spark.sql("INSERT INTO tab VALUES (NULL, 3)");
> spark.sql("INSERT INTO tab VALUES (3, NULL)");
> spark.sql("SELECT " +
> "regr_slope(x, y), " +
> "regr_intercept(x, y)" +
> "FROM tab").show(); {code}
> Spark result:
> {code:java

[jira] [Created] (SPARK-48719) Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL

2024-06-25 Thread Jonathon Lee (Jira)

Jonathon Lee created SPARK-48719:


 Summary: Wrong Result in regr_slope®r_intercept Aggregate with 
Tuples has NULL
 Key: SPARK-48719
 URL: https://issues.apache.org/jira/browse/SPARK-48719
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 3.4.0
Reporter: Jonathon Lee


When calculate slope and intercept using regr_slope & regr_intercept aggregate:

(using Java api)
{code:java}
spark.sql("drop table if exists tab");
spark.sql("CREATE TABLE tab(y int, x int) using parquet");
spark.sql("INSERT INTO tab VALUES (1, 1)");
spark.sql("INSERT INTO tab VALUES (2, 3)");
spark.sql("INSERT INTO tab VALUES (3, 5)");
spark.sql("INSERT INTO tab VALUES (NULL, 3)");
spark.sql("INSERT INTO tab VALUES (3, NULL)");
spark.sql("SELECT " +
"regr_slope(x, y), " +
"regr_intercept(x, y)" +
"FROM tab").show(); {code}
Spark result:
{code:java}
+--++
|  regr_slope(x, y)|regr_intercept(x, y)|
+--++
|1.4545454545454546| 0.09090909090909083|
+--++ {code}
The correct answer should be 2.0 and -1.0 obviously.

 

Reason:

In sql/catalyst/expressions/aggregate/linearRegression.scala,

 
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends 
DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(right)
.. {code}
CovPopulation will filter tuples which right *OR* left is NULL

But VariancePop will only filter null right expression.

This will cause wrong result when some of the tuples' left is null (and right 
is not null).

Same reason with RegrIntercept.

 

A possible fix:
{code:java}
case class RegrSlope(left: Expression, right: Expression) extends 
DeclarativeAggregate
  with ImplicitCastInputTypes with BinaryLike[Expression] {

  private val covarPop = new CovPopulation(right, left)

  private val varPop = new VariancePop(If(And(IsNotNull(left), 
IsNotNull(right)),
right, Literal.create(null, right.dataType))) 
.{code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48573) Upgrade ICU version

2024-06-25 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48573.
--
Target Version/s: 4.0.0
Assignee: Mihailo Milosevic
  Resolution: Fixed

> Upgrade ICU version
> ---
>
> Key: SPARK-48573
> URL: https://issues.apache.org/jira/browse/SPARK-48573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48573) Upgrade ICU version

2024-06-25 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48573:
-
Fix Version/s: 4.0.0

> Upgrade ICU version
> ---
>
> Key: SPARK-48573
> URL: https://issues.apache.org/jira/browse/SPARK-48573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48718) Got incastable error when deserializer in cogroup is resolved during application of DeduplicateRelation rule

2024-06-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48718.
-
Fix Version/s: 4.0.0
 Assignee: Xinyi Yu
   Resolution: Fixed

> Got incastable error when deserializer in cogroup is resolved during 
> application of DeduplicateRelation rule
> 
>
> Key: SPARK-48718
> URL: https://issues.apache.org/jira/browse/SPARK-48718
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Xinyi Yu
>Assignee: Xinyi Yu
>Priority: Major
> Fix For: 4.0.0
>
>
> When running the following commands:
>  
>  
> {code:java}
>     val lhs = spark.createDataFrame(
>       List(Row(123L)).asJava,
>       StructType(Seq(StructField("GROUPING_KEY", LongType)))
>     )
>     val rhs = spark.createDataFrame(
>       List(Row(0L, 123L)).asJava,
>       StructType(Seq(StructField("ID", LongType), StructField("GROUPING_KEY", 
> LongType)))
>     )    val lhsKV = lhs.groupByKey((r: Row) => r.getAs[Long]("GROUPING_KEY"))
>     val rhsKV = rhs.groupByKey((r: Row) => r.getAs[Long]("GROUPING_KEY"))
>     val cogrouped = lhsKV.cogroup(rhsKV)(
>       (a: Long, b: Iterator[Row], c: Iterator[Row]) => Iterator(0L)
>     )
>     val joined = rhs.join(cogrouped, col("ID") === col("value"), "left") 
> {code}
>  
>  
> It gets an error:
> {code:java}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull cannot be 
> cast to org.apache.spark.sql.catalyst.analysis.UnresolvedDeserializer {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43781) IllegalStateException when cogrouping two datasets derived from the same source

2024-06-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-43781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-43781:

Fix Version/s: 4.0.0
   (was: 3.5.0)

> IllegalStateException when cogrouping two datasets derived from the same 
> source
> ---
>
> Key: SPARK-43781
> URL: https://issues.apache.org/jira/browse/SPARK-43781
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1, 3.4.0
> Environment: Reproduces in a unit test, using Spark 3.3.1, the Java 
> API, and a {{local[2]}} SparkSession.
>Reporter: Derek Murray
>Assignee: Jia Fan
>Priority: Major
> Fix For: 4.0.0
>
>
> Attempting to {{cogroup}} two datasets derived from the same source dataset 
> yields an {{IllegalStateException}} when the query is executed.
> Minimal reproducer:
> {code:java}
> StructType inputType = DataTypes.createStructType(
> new StructField[]{
> DataTypes.createStructField("id", DataTypes.LongType, false),
> DataTypes.createStructField("type", DataTypes.StringType, false)
> }
> );
> StructType keyType = DataTypes.createStructType(
> new StructField[]{
> DataTypes.createStructField("id", DataTypes.LongType, false)
> }
> );
> List inputRows = new ArrayList<>();
> inputRows.add(RowFactory.create(1L, "foo"));
> inputRows.add(RowFactory.create(1L, "bar"));
> inputRows.add(RowFactory.create(2L, "foo"));
> Dataset input = sparkSession.createDataFrame(inputRows, inputType);
> KeyValueGroupedDataset fooGroups = input
> .filter("type = 'foo'")
> .groupBy("id")
> .as(RowEncoder.apply(keyType), RowEncoder.apply(inputType));
> KeyValueGroupedDataset barGroups = input
> .filter("type = 'bar'")
> .groupBy("id")
> .as(RowEncoder.apply(keyType), RowEncoder.apply(inputType));
> Dataset result = fooGroups.cogroup(
> barGroups,
> (CoGroupFunction) (row, iterator, iterator1) -> new 
> ArrayList().iterator(),
> RowEncoder.apply(inputType));
> result.explain();
> result.show();{code}
> Explain output (note mismatch in column IDs between Sort/Exchagne and 
> LocalTableScan on the first input to the CoGroup):
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- SerializeFromObject 
> [validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, id), LongType, false) AS id#37L, 
> staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
> fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 1, type), StringType, false), true, false, 
> true) AS type#38]
>    +- CoGroup 
> org.apache.spark.sql.KeyValueGroupedDataset$$Lambda$1478/1869116781@77856cc5, 
> createexternalrow(id#16L, StructField(id,LongType,false)), 
> createexternalrow(id#16L, type#17.toString, StructField(id,LongType,false), 
> StructField(type,StringType,false)), createexternalrow(id#16L, 
> type#17.toString, StructField(id,LongType,false), 
> StructField(type,StringType,false)), [id#39L], [id#39L], [id#39L, type#40], 
> [id#39L, type#40], obj#36: org.apache.spark.sql.Row
>       :- !Sort [id#39L ASC NULLS FIRST], false, 0
>       :  +- !Exchange hashpartitioning(id#39L, 2), ENSURE_REQUIREMENTS, 
> [plan_id=19]
>       :     +- LocalTableScan [id#16L, type#17]
>       +- Sort [id#39L ASC NULLS FIRST], false, 0
>          +- Exchange hashpartitioning(id#39L, 2), ENSURE_REQUIREMENTS, 
> [plan_id=20]
>             +- LocalTableScan [id#39L, type#40]{code}
> Exception:
> {code:java}
> java.lang.IllegalStateException: Couldn't find id#39L in [id#16L,type#17]
>         at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:80)
>         at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:73)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
>         at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:589)
>         at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:75)
>         at scala.collection.immutable.ArraySeq.map(ArraySeq.scala:35)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:698)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:589)
>         at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$tra

[jira] [Created] (SPARK-48718) Got incastable error when deserializer in cogroup is resolved during application of DeduplicateRelation rule

2024-06-25 Thread Xinyi Yu (Jira)

Xinyi Yu created SPARK-48718:


 Summary: Got incastable error when deserializer in cogroup is 
resolved during application of DeduplicateRelation rule
 Key: SPARK-48718
 URL: https://issues.apache.org/jira/browse/SPARK-48718
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Xinyi Yu


When running the following commands:

 

 
{code:java}
    val lhs = spark.createDataFrame(
      List(Row(123L)).asJava,
      StructType(Seq(StructField("GROUPING_KEY", LongType)))
    )
    val rhs = spark.createDataFrame(
      List(Row(0L, 123L)).asJava,
      StructType(Seq(StructField("ID", LongType), StructField("GROUPING_KEY", 
LongType)))
    )    val lhsKV = lhs.groupByKey((r: Row) => r.getAs[Long]("GROUPING_KEY"))
    val rhsKV = rhs.groupByKey((r: Row) => r.getAs[Long]("GROUPING_KEY"))
    val cogrouped = lhsKV.cogroup(rhsKV)(
      (a: Long, b: Iterator[Row], c: Iterator[Row]) => Iterator(0L)
    )
    val joined = rhs.join(cogrouped, col("ID") === col("value"), "left") {code}
 

 

It gets an error:
{code:java}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull cannot be cast 
to org.apache.spark.sql.catalyst.analysis.UnresolvedDeserializer {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48717) Python foreachBatch streaming query cannot be stopped gracefully after pin thread mode is enabled and is running spark queries

2024-06-25 Thread Wei Liu (Jira)

Wei Liu created SPARK-48717:
---

 Summary: Python foreachBatch streaming query cannot be stopped 
gracefully after pin thread mode is enabled and is running spark queries 
 Key: SPARK-48717
 URL: https://issues.apache.org/jira/browse/SPARK-48717
 Project: Spark
  Issue Type: New Feature
  Components: PySpark, SS
Affects Versions: 4.0.0
Reporter: Wei Liu


Followup of https://issues.apache.org/jira/browse/SPARK-39218

 

It only considered the InterruptedException is thrown when time.sleep(10) is 
intercepted. But when a spark query is executing:
{code:java}
def func(batch_df, batch_id):
batch_df.sparkSession.range(1000).write.saveAsTable("oops")
print(batch_df.count()) {code}
the actual error would be:

 
{code:java}
py4j.protocol.Py4JJavaError: An error occurred while calling 
o2141502.saveAsTable.  
: org.apache.spark.SparkException: Job aborted.  



...
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$2(StreamExecution.scala:262)
  
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)  
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:262)
  
*Caused by: java.lang.InterruptedException  
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1000)*
  
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1308)
   {code}
We should also add consideration to this scenario 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48638) Native QueryExecution information for the dataframe

2024-06-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48638.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46996
[https://github.com/apache/spark/pull/46996]

> Native QueryExecution information for the dataframe
> ---
>
> Key: SPARK-48638
> URL: https://issues.apache.org/jira/browse/SPARK-48638
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Adding a new property to `DataFrame` called `queryExecution` that returns a 
> class that contains information about the query execution and it's metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48638) Native QueryExecution information for the dataframe

2024-06-25 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48638:


Assignee: Martin Grund

> Native QueryExecution information for the dataframe
> ---
>
> Key: SPARK-48638
> URL: https://issues.apache.org/jira/browse/SPARK-48638
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Martin Grund
>Assignee: Martin Grund
>Priority: Major
>  Labels: pull-request-available
>
> Adding a new property to `DataFrame` called `queryExecution` that returns a 
> class that contains information about the query execution and it's metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48716) Add JobGroupId to onSqlStart

2024-06-25 Thread Lingkai Kong (Jira)

Lingkai Kong created SPARK-48716:


 Summary: Add JobGroupId to onSqlStart
 Key: SPARK-48716
 URL: https://issues.apache.org/jira/browse/SPARK-48716
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.5.1
Reporter: Lingkai Kong






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] (SPARK-39901) Reconsider design of ignoreCorruptFiles feature

2024-06-25 Thread Wei Guo (Jira)



[ https://issues.apache.org/jira/browse/SPARK-39901 ]


Wei Guo deleted comment on SPARK-39901:
-

was (Author: wayne guo):
The `ignoreCorruptFiles` features in SQL(spark.sql.files.ignoreCorruptFiles) 
and RDD(spark.files.ignoreCorruptFiles) scenarios need to be included both. 

> Reconsider design of ignoreCorruptFiles feature
> ---
>
> Key: SPARK-39901
> URL: https://issues.apache.org/jira/browse/SPARK-39901
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> I'm filing this ticket as a followup to the discussion at 
> [https://github.com/apache/spark/pull/36775#issuecomment-1148136217] 
> regarding the `ignoreCorruptFiles` feature: the current implementation is 
> based towards considering a broad range of IOExceptions to be corruption, but 
> this is likely overly-broad and might mis-identify transient errors as 
> corruption (causing non-corrupt data to be erroneously discarded).
> SPARK-39389 fixes one instance of that problem, but we are still vulnerable 
> to similar issues because of the overall design of this feature.
> I think we should reconsider the design of this feature: maybe we should 
> switch the default behavior so that only an explicit allowlist of known 
> corruption exceptions can cause files to be skipped. This could be done 
> through involvement of other parts of the code, e.g. rewrapping exceptions 
> into a `CorruptFileException` so higher layers can positively identify 
> corruption.
> Any changes to behavior here could potentially impact users jobs, so we'd 
> need to think carefully about when we want to change (in a 3.x release? 4.x?) 
> and how we want to provide escape hatches (e.g. configs to revert back to old 
> behavior). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48715) UTF8String - Java String conversions should use Unicode replacement logic

2024-06-25 Thread Jira

Uroš Bojanić created SPARK-48715:


 Summary: UTF8String - Java String conversions should use Unicode 
replacement logic
 Key: SPARK-48715
 URL: https://issues.apache.org/jira/browse/SPARK-48715
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Uroš Bojanić






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48578) Add new expressions for UTF8 string validation

2024-06-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48578.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46845
[https://github.com/apache/spark/pull/46845]

> Add new expressions for UTF8 string validation
> --
>
> Key: SPARK-48578
> URL: https://issues.apache.org/jira/browse/SPARK-48578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48578) Add new expressions for UTF8 string validation

2024-06-25 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48578:
---

Assignee: Uroš Bojanić

> Add new expressions for UTF8 string validation
> --
>
> Key: SPARK-48578
> URL: https://issues.apache.org/jira/browse/SPARK-48578
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-47927) Nullability after join not respected in UDF

2024-06-25 Thread GridGain Integration (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-47927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859940#comment-17859940
 ] 

GridGain Integration commented on SPARK-47927:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/47081

> Nullability after join not respected in UDF
> ---
>
> Key: SPARK-47927
> URL: https://issues.apache.org/jira/browse/SPARK-47927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0, 3.5.1, 3.4.3
>Reporter: Emil Ejbyfeldt
>Assignee: Emil Ejbyfeldt
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 4.0.0, 3.5.2, 3.4.4
>
>
> {code:java}
> val ds1 = Seq(1).toDS()
> val ds2 = Seq[Int]().toDS()
> val f = udf[(Int, Option[Int]), (Int, Option[Int])](identity)
> ds1.join(ds2, ds1("value") === ds2("value"), 
> "outer").select(f(struct(ds1("value"), ds2("value".show()
> ds1.join(ds2, ds1("value") === ds2("value"), 
> "outer").select(struct(ds1("value"), ds2("value"))).show() {code}
> outputs
> {code:java}
> +---+
> |UDF(struct(value, value, value, value))|
> +---+
> |                                 {1, 0}|
> +---+
> ++
> |struct(value, value)|
> ++
> |           {1, NULL}|
> ++ {code}
> So when the result is passed to UDF the null-ability after the the join is 
> not respected and we incorrectly end up with a 0 value instead of a null/None 
> value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48714) Implement df.mergeInto in PySpark

2024-06-25 Thread Pengfei Xu (Jira)

Pengfei Xu created SPARK-48714:
--

 Summary: Implement df.mergeInto in PySpark
 Key: SPARK-48714
 URL: https://issues.apache.org/jira/browse/SPARK-48714
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Pengfei Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48713) Add index range check for UnsafeRow.pointTo when baseObject is byte array

2024-06-25 Thread wuyi (Jira)

wuyi created SPARK-48713:


 Summary: Add index range check for UnsafeRow.pointTo when 
baseObject is byte array
 Key: SPARK-48713
 URL: https://issues.apache.org/jira/browse/SPARK-48713
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: wuyi






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-48712) Perf Improvement for Encode with empty string and UTF-8 charset

2024-06-25 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-48712:
-
Description: 
 Apple M2 Max
 encode:                                   Best Time(ms)   Avg Time(ms)   
Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 

-UTF-8                                              3672           3697         
 22          5.4         183.6       1.0X
+UTF-8                                             79270          79698         
448          0.3        3963.5       1.0X

> Perf Improvement for Encode with empty string and UTF-8 charset
> ---
>
> Key: SPARK-48712
> URL: https://issues.apache.org/jira/browse/SPARK-48712
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>
>  Apple M2 Max
>  encode:                                   Best Time(ms)   Avg Time(ms)   
> Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
>  
> 
> -UTF-8                                              3672           3697       
>    22          5.4         183.6       1.0X
> +UTF-8                                             79270          79698       
>   448          0.3        3963.5       1.0X



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48712) Perf Improvement for Encode with empty string and UTF-8 charset

2024-06-25 Thread Kent Yao (Jira)

Kent Yao created SPARK-48712:


 Summary: Perf Improvement for Encode with empty string and UTF-8 
charset
 Key: SPARK-48712
 URL: https://issues.apache.org/jira/browse/SPARK-48712
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48693) simplify and unify toString of Invoke and StaticInvoke

2024-06-25 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-48693.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47066
[https://github.com/apache/spark/pull/47066]

> simplify and unify toString of Invoke and StaticInvoke
> --
>
> Key: SPARK-48693
> URL: https://issues.apache.org/jira/browse/SPARK-48693
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48711) OOM killer may leave SparkContext in broken state causing ConnectionRefusedError

2024-06-25 Thread Rafal Wojdyla (Jira)

Rafal Wojdyla created SPARK-48711:
-

 Summary: OOM killer may leave SparkContext in broken state causing 
ConnectionRefusedError
 Key: SPARK-48711
 URL: https://issues.apache.org/jira/browse/SPARK-48711
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 3.5.0
Reporter: Rafal Wojdyla


Related to https://issues.apache.org/jira/browse/SPARK-18523, and 
https://github.com/apache/spark/pull/15961. I'm currently on:

{code}
pyspark   3.5.0  pyhd8ed1ab_0conda-forge
py4j  0.10.9.7   pyhd8ed1ab_0conda-forge
{code}

When Spark JVM process gets OOM-Killed, `SparkContext.stop` fails with 
`ConnectionRefusedError`, which leaves the `SparkSession/Context` in a "dirty" 
state. https://issues.apache.org/jira/browse/SPARK-18523 addressed this by 
catching the {{Py4JError}} it looks like the code now raises 
{{ConnectionRefusedError}}:

{code}
Traceback (most recent call last):
  ...
  File "/lib/python3.11/site-packages/pyspark/sql/session.py", line 
1796, in stop
self._sc.stop()
  File "/lib/python3.11/site-packages/pyspark/context.py", line 654, in 
stop
self._jsc.stop()
  File "/lib/python3.11/site-packages/py4j/java_gateway.py", line 1321, 
in __call__
answer = self.gateway_client.send_command(command)
 ^
  File "/lib/python3.11/site-packages/py4j/java_gateway.py", line 1036, 
in send_command
connection = self._get_connection()
 ^^
  File "/lib/python3.11/site-packages/py4j/clientserver.py", line 284, 
in _get_connection
connection = self._create_new_connection()
 ^
  File "/lib/python3.11/site-packages/py4j/clientserver.py", line 291, 
in _create_new_connection
connection.connect_to_java_server()
  File "/lib/python3.11/site-packages/py4j/clientserver.py", line 438, 
in connect_to_java_server
self.socket.connect((self.java_address, self.java_port))
ConnectionRefusedError: [Errno 111] Connection refused
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48710) Incompatibilities with NumPy 2.0

2024-06-25 Thread Patrick Marx (Jira)

Patrick Marx created SPARK-48710:


 Summary: Incompatibilities with NumPy 2.0
 Key: SPARK-48710
 URL: https://issues.apache.org/jira/browse/SPARK-48710
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Patrick Marx


PySpark references some code which was removed with NumPy 2.0.

 {{python/pyspark/pandas/strings.py}}:
 * {{np.NaN}} was removed, should be replaced with {{np.nan}}

{{python/pyspark/pandas/typedef/typehints.py}}
 * {{np.string_}} was removed, [is an alias 
for|https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3134] 
{{np.bytes_}}
 * {{np.float_}} was removed, [is defined the same 
as|https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3042-3043] 
{{np.double}}
 * {{np.unicode_}} was removed, [is an alias 
for|https://github.com/numpy/numpy/blob/v1.26.5/numpy/__init__.pyi#L3148] 
{{np.str_}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48698) Support analyze column stats for tables with collated columns

2024-06-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48698:
--

Assignee: (was: Apache Spark)

> Support analyze column stats for tables with collated columns
> -
>
> Key: SPARK-48698
> URL: https://issues.apache.org/jira/browse/SPARK-48698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>
> Following sequence fails:
> {code:java}
> > create table t(s string collate utf8_lcase) using parquet;
> > insert into t values ('A');
> > analyze table t compute statistics for all columns;
> [UNSUPPORTED_FEATURE.ANALYZE_UNSUPPORTED_COLUMN_TYPE] The feature is not 
> supported: The ANALYZE TABLE FOR COLUMNS command does not support the type 
> "STRING COLLATE UTF8_LCASE" of the column `s` in the table 
> `spark_catalog`.`default`.`t`. SQLSTATE: 0A000
>  {code}
> Users should be able to run ANALYZE commands on tables which have columns 
> with collated type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48698) Support analyze column stats for tables with collated columns

2024-06-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48698?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48698:
--

Assignee: (was: Apache Spark)

> Support analyze column stats for tables with collated columns
> -
>
> Key: SPARK-48698
> URL: https://issues.apache.org/jira/browse/SPARK-48698
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Nikola Mandic
>Priority: Major
>
> Following sequence fails:
> {code:java}
> > create table t(s string collate utf8_lcase) using parquet;
> > insert into t values ('A');
> > analyze table t compute statistics for all columns;
> [UNSUPPORTED_FEATURE.ANALYZE_UNSUPPORTED_COLUMN_TYPE] The feature is not 
> supported: The ANALYZE TABLE FOR COLUMNS command does not support the type 
> "STRING COLLATE UTF8_LCASE" of the column `s` in the table 
> `spark_catalog`.`default`.`t`. SQLSTATE: 0A000
>  {code}
> Users should be able to run ANALYZE commands on tables which have columns 
> with collated type.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-48177) Upgrade `Parquet` to 1.14.1

2024-06-25 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-48177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48177:
--

Assignee: Apache Spark  (was: Fokko Driesprong)

> Upgrade `Parquet` to 1.14.1
> ---
>
> Key: SPARK-48177
> URL: https://issues.apache.org/jira/browse/SPARK-48177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Fokko Driesprong
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-48709) varchar resolution mismatch for DataSourceV2 CTAS

2024-06-25 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-48709:
---

 Summary: varchar resolution mismatch for DataSourceV2 CTAS
 Key: SPARK-48709
 URL: https://issues.apache.org/jira/browse/SPARK-48709
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.1, 3.5.0, 4.0.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-48705) Explicitly use worker_main when it starts with pyspark

[jira] [Assigned] (SPARK-48705) Explicitly use worker_main when it starts with pyspark

[jira] [Assigned] (SPARK-48706) Python UDF in higher order functions should not throw internal error

[jira] [Resolved] (SPARK-48706) Python UDF in higher order functions should not throw internal error

[jira] [Updated] (SPARK-48719) Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL

[jira] [Created] (SPARK-48719) Wrong Result in regr_slope®r_intercept Aggregate with Tuples has NULL

[jira] [Resolved] (SPARK-48573) Upgrade ICU version

[jira] [Updated] (SPARK-48573) Upgrade ICU version

[jira] [Resolved] (SPARK-48718) Got incastable error when deserializer in cogroup is resolved during application of DeduplicateRelation rule

[jira] [Updated] (SPARK-43781) IllegalStateException when cogrouping two datasets derived from the same source

[jira] [Created] (SPARK-48718) Got incastable error when deserializer in cogroup is resolved during application of DeduplicateRelation rule

[jira] [Created] (SPARK-48717) Python foreachBatch streaming query cannot be stopped gracefully after pin thread mode is enabled and is running spark queries

[jira] [Resolved] (SPARK-48638) Native QueryExecution information for the dataframe

[jira] [Assigned] (SPARK-48638) Native QueryExecution information for the dataframe

[jira] [Created] (SPARK-48716) Add JobGroupId to onSqlStart

[jira] (SPARK-39901) Reconsider design of ignoreCorruptFiles feature

[jira] [Created] (SPARK-48715) UTF8String - Java String conversions should use Unicode replacement logic

[jira] [Resolved] (SPARK-48578) Add new expressions for UTF8 string validation

[jira] [Assigned] (SPARK-48578) Add new expressions for UTF8 string validation

[jira] [Commented] (SPARK-47927) Nullability after join not respected in UDF

[jira] [Created] (SPARK-48714) Implement df.mergeInto in PySpark

[jira] [Created] (SPARK-48713) Add index range check for UnsafeRow.pointTo when baseObject is byte array

[jira] [Updated] (SPARK-48712) Perf Improvement for Encode with empty string and UTF-8 charset

[jira] [Created] (SPARK-48712) Perf Improvement for Encode with empty string and UTF-8 charset

[jira] [Resolved] (SPARK-48693) simplify and unify toString of Invoke and StaticInvoke

[jira] [Created] (SPARK-48711) OOM killer may leave SparkContext in broken state causing ConnectionRefusedError

[jira] [Created] (SPARK-48710) Incompatibilities with NumPy 2.0

[jira] [Assigned] (SPARK-48698) Support analyze column stats for tables with collated columns

[jira] [Assigned] (SPARK-48698) Support analyze column stats for tables with collated columns

[jira] [Assigned] (SPARK-48177) Upgrade `Parquet` to 1.14.1

[jira] [Created] (SPARK-48709) varchar resolution mismatch for DataSourceV2 CTAS

31 matches

Site Navigation

Mail list logo

Footer information