[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40439: - Description: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting {{spark.sql.ansi.enabled}} _to {{False}}_ failed as well (which may be an independent issue). was: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision
[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40439: - Description: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting {{spark.sql.ansi.enabled}} _to {{False}}_ failed as well (which may be an independent issue). was: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into
[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40439: - Description: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting {{spark.sql.ansi.enabled}} _to {{False}}_ failed as well (which may be an independent issue). was: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into
[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40439: - Description: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow {}}}is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting _spark.sql.ansi.enabled to False_ failed as well (which may be an independent issue). was: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision
[jira] [Created] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
xsys created SPARK-40439: Summary: DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame Key: SPARK-40439 URL: https://issues.apache.org/jira/browse/SPARK-40439 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: xsys h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow {}}}is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting {{spark.sql.ansi.enabled }}to False failed as well (which may be an independent issue). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
[ https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-40439: - Description: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} h3. Expected behavior We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to behave consistently for the same data type & input combination ({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). Here is a simplified example in {{{}spark-shell{}}}, where insertion of the aforementioned decimal value evaluates to a {{{}NULL{}}}: {code:java} scala> import org.apache.spark.sql.{Row, SparkSession} import org.apache.spark.sql.{Row, SparkSession} scala> import org.apache.spark.sql.types._ import org.apache.spark.sql.types._ scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22" rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = ParallelCollectionRDD[0] at parallelize at :27 scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), true)) schema: org.apache.spark.sql.types.StructType = StructType(StructField(c1,DecimalType(20,10),true)) scala> val df = spark.createDataFrame(rdd, schema) df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] scala> df.show() ++ | c1| ++ |null| ++ scala> df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision") 22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. scala> spark.sql("select * from decimal_extra_precision;") res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)] {code} h3. Root Cause The exception is being raised from [Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373] ({{{}nullOnOverflow {{is controlled by {{spark.sql.ansi.enabled}} in [SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].): {code:java} private[sql] def toPrecision( precision: Int, scale: Int, roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP, nullOnOverflow: Boolean = true, context: SQLQueryContext = null): Decimal = { val copy = clone() if (copy.changePrecision(precision, scale, roundMode)) { copy } else { if (nullOnOverflow) { null } else { throw QueryExecutionErrors.cannotChangeDecimalPrecisionError( this, precision, scale, context) } } }{code} The above function is invoked from [toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756] (in Cast.scala). However, our attempt to insert {{333.22}} after setting \{{spark.sql.ansi.enabled }}to False failed as well (which may be an independent issue). was: h3. Describe the bug We are trying to store a DECIMAL value {{333.22}} with more precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This leads to a {{NULL}} value being stored if the table is created using DataFrames via {{{}spark-shell{}}}. However, it leads to the following exception if the table is created via {{{}spark-sql{}}}: {code:java} Failed in [insert into decimal_extra_precision select 333.22] java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) cannot be represented as Decimal(20, 10){code} h3. Step to reproduce: On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into decimal_extra_precision select 333.22;{code} Execute the following: {code:java} create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC; insert into
[jira] [Commented] (SPARK-40435) Add test suites for applyInPandasWithState in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605101#comment-17605101 ] Apache Spark commented on SPARK-40435: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/37894 > Add test suites for applyInPandasWithState in PySpark > - > > Key: SPARK-40435 > URL: https://issues.apache.org/jira/browse/SPARK-40435 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Basically port the test suite from Scala/Java version of API to Python API. > Have e2e test suite purely implemented with python. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40435) Add test suites for applyInPandasWithState in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605100#comment-17605100 ] Apache Spark commented on SPARK-40435: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/37894 > Add test suites for applyInPandasWithState in PySpark > - > > Key: SPARK-40435 > URL: https://issues.apache.org/jira/browse/SPARK-40435 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Basically port the test suite from Scala/Java version of API to Python API. > Have e2e test suite purely implemented with python. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40435) Add test suites for applyInPandasWithState in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40435: Assignee: (was: Apache Spark) > Add test suites for applyInPandasWithState in PySpark > - > > Key: SPARK-40435 > URL: https://issues.apache.org/jira/browse/SPARK-40435 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Basically port the test suite from Scala/Java version of API to Python API. > Have e2e test suite purely implemented with python. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40435) Add test suites for applyInPandasWithState in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40435: Assignee: Apache Spark > Add test suites for applyInPandasWithState in PySpark > - > > Key: SPARK-40435 > URL: https://issues.apache.org/jira/browse/SPARK-40435 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > Basically port the test suite from Scala/Java version of API to Python API. > Have e2e test suite purely implemented with python. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40434) Implement applyInPandasWithState in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40434: Assignee: Apache Spark > Implement applyInPandasWithState in PySpark > --- > > Key: SPARK-40434 > URL: https://issues.apache.org/jira/browse/SPARK-40434 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > Provide the full implementation of flatMapGroupsWithState equivalent API in > PySpark. We could optionally introduce test suites in following JIRA ticket > if the PR is too huge. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40434) Implement applyInPandasWithState in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40434: Assignee: (was: Apache Spark) > Implement applyInPandasWithState in PySpark > --- > > Key: SPARK-40434 > URL: https://issues.apache.org/jira/browse/SPARK-40434 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Provide the full implementation of flatMapGroupsWithState equivalent API in > PySpark. We could optionally introduce test suites in following JIRA ticket > if the PR is too huge. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40434) Implement applyInPandasWithState in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605095#comment-17605095 ] Apache Spark commented on SPARK-40434: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/37893 > Implement applyInPandasWithState in PySpark > --- > > Key: SPARK-40434 > URL: https://issues.apache.org/jira/browse/SPARK-40434 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Provide the full implementation of flatMapGroupsWithState equivalent API in > PySpark. We could optionally introduce test suites in following JIRA ticket > if the PR is too huge. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40437) Support string representation of durationMs in GroupState.setTimeoutDuration
[ https://issues.apache.org/jira/browse/SPARK-40437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605089#comment-17605089 ] Hyukjin Kwon commented on SPARK-40437: -- [~kabhwan] I didn't add this to SPARK-40431 because I think this isn't our priority for the initial implementation but feel free to add it. I don't mind either way. > Support string representation of durationMs in GroupState.setTimeoutDuration > > > Key: SPARK-40437 > URL: https://issues.apache.org/jira/browse/SPARK-40437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > > GroupStateImpl.setTimeoutDuration should support string representation to > match with Scala's side support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40438) Support additionalDuration parameter in GroupState.setTimeoutTimestamp
[ https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605088#comment-17605088 ] Hyukjin Kwon commented on SPARK-40438: -- [~kabhwan] I didn't add this to SPARK-40431 because I think this isn't our priority for the initial implementation but feel free to add it. I don't mind either way. > Support additionalDuration parameter in GroupState.setTimeoutTimestamp > -- > > Key: SPARK-40438 > URL: https://issues.apache.org/jira/browse/SPARK-40438 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > > GroupState.setTimeoutTimestamp should support additionalDuration parameter to > match with Scala's side support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40438) Support additionalDuration parameter in GroupState.setTimeoutTimestamp
[ https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40438: - Summary: Support additionalDuration parameter in GroupState.setTimeoutTimestamp (was: Support in GroupState.setTimeoutTimestamp) > Support additionalDuration parameter in GroupState.setTimeoutTimestamp > -- > > Key: SPARK-40438 > URL: https://issues.apache.org/jira/browse/SPARK-40438 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > > GroupStateImpl.additionalDuration should support string representation to > match with Scala's side support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40438) Support additionalDuration parameter in GroupState.setTimeoutTimestamp
[ https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40438: - Description: GroupState.setTimeoutTimestamp should support additionalDuration parameter to match with Scala's side support. (was: GroupState.setTimeoutTimestamp should support string representation to match with Scala's side support.) > Support additionalDuration parameter in GroupState.setTimeoutTimestamp > -- > > Key: SPARK-40438 > URL: https://issues.apache.org/jira/browse/SPARK-40438 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > > GroupState.setTimeoutTimestamp should support additionalDuration parameter to > match with Scala's side support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40438) Support in GroupState.setTimeoutTimestamp
[ https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40438: - Summary: Support in GroupState.setTimeoutTimestamp (was: Implement additionalDuration parameter in GroupState) > Support in GroupState.setTimeoutTimestamp > -- > > Key: SPARK-40438 > URL: https://issues.apache.org/jira/browse/SPARK-40438 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > > GroupStateImpl.additionalDuration should support string representation to > match with Scala's side support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40437) Support string representation of durationMs in GroupState.setTimeoutDuration
[ https://issues.apache.org/jira/browse/SPARK-40437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40437: - Summary: Support string representation of durationMs in GroupState.setTimeoutDuration (was: Support string representation of durationMs in GroupState) > Support string representation of durationMs in GroupState.setTimeoutDuration > > > Key: SPARK-40437 > URL: https://issues.apache.org/jira/browse/SPARK-40437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > > GroupStateImpl.setTimeoutDuration should support string representation to > match with Scala's side support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40438) Implement additionalDuration parameter in GroupState
[ https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40438: - Description: GroupStateImpl.additionalDuration should support string representation to match with Scala's side support. (was: GroupStateImpl.setTimeoutDuration should support string representation to match with Scala's side support.) > Implement additionalDuration parameter in GroupState > > > Key: SPARK-40438 > URL: https://issues.apache.org/jira/browse/SPARK-40438 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > > GroupStateImpl.additionalDuration should support string representation to > match with Scala's side support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40438) Support additionalDuration parameter in GroupState.setTimeoutTimestamp
[ https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40438: - Description: GroupState.setTimeoutTimestamp should support string representation to match with Scala's side support. (was: GroupStateImpl.additionalDuration should support string representation to match with Scala's side support.) > Support additionalDuration parameter in GroupState.setTimeoutTimestamp > -- > > Key: SPARK-40438 > URL: https://issues.apache.org/jira/browse/SPARK-40438 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > > GroupState.setTimeoutTimestamp should support string representation to match > with Scala's side support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40438) Implement additionalDuration parameter in GroupState
[ https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40438: - Description: GroupStateImpl.setTimeoutDuration should support string representation to match with Scala's side support. > Implement additionalDuration parameter in GroupState > > > Key: SPARK-40438 > URL: https://issues.apache.org/jira/browse/SPARK-40438 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > > GroupStateImpl.setTimeoutDuration should support string representation to > match with Scala's side support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40438) Implement additionalDuration parameter in GroupState
Hyukjin Kwon created SPARK-40438: Summary: Implement additionalDuration parameter in GroupState Key: SPARK-40438 URL: https://issues.apache.org/jira/browse/SPARK-40438 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40437) Support string representation of durationMs in GroupState
Hyukjin Kwon created SPARK-40437: Summary: Support string representation of durationMs in GroupState Key: SPARK-40437 URL: https://issues.apache.org/jira/browse/SPARK-40437 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.4.0 Reporter: Hyukjin Kwon GroupStateImpl.setTimeoutDuration should support string representation to match with Scala's side support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40437) Support string representation of durationMs in GroupState
[ https://issues.apache.org/jira/browse/SPARK-40437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40437: - Priority: Minor (was: Major) > Support string representation of durationMs in GroupState > - > > Key: SPARK-40437 > URL: https://issues.apache.org/jira/browse/SPARK-40437 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > > GroupStateImpl.setTimeoutDuration should support string representation to > match with Scala's side support. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40436) Upgrade Scala to 2.12.17
[ https://issues.apache.org/jira/browse/SPARK-40436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605082#comment-17605082 ] Apache Spark commented on SPARK-40436: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37892 > Upgrade Scala to 2.12.17 > > > Key: SPARK-40436 > URL: https://issues.apache.org/jira/browse/SPARK-40436 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://github.com/scala/scala/releases/tag/v2.12.17 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40436) Upgrade Scala to 2.12.17
[ https://issues.apache.org/jira/browse/SPARK-40436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40436: Assignee: (was: Apache Spark) > Upgrade Scala to 2.12.17 > > > Key: SPARK-40436 > URL: https://issues.apache.org/jira/browse/SPARK-40436 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://github.com/scala/scala/releases/tag/v2.12.17 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40436) Upgrade Scala to 2.12.17
[ https://issues.apache.org/jira/browse/SPARK-40436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40436: Assignee: Apache Spark > Upgrade Scala to 2.12.17 > > > Key: SPARK-40436 > URL: https://issues.apache.org/jira/browse/SPARK-40436 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > https://github.com/scala/scala/releases/tag/v2.12.17 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40436) Upgrade Scala to 2.12.17
[ https://issues.apache.org/jira/browse/SPARK-40436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605081#comment-17605081 ] Apache Spark commented on SPARK-40436: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/37892 > Upgrade Scala to 2.12.17 > > > Key: SPARK-40436 > URL: https://issues.apache.org/jira/browse/SPARK-40436 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://github.com/scala/scala/releases/tag/v2.12.17 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40433) Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row
[ https://issues.apache.org/jira/browse/SPARK-40433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40433: Assignee: (was: Apache Spark) > Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row > > > Key: SPARK-40433 > URL: https://issues.apache.org/jira/browse/SPARK-40433 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40433) Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row
[ https://issues.apache.org/jira/browse/SPARK-40433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40433: Assignee: Apache Spark > Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row > > > Key: SPARK-40433 > URL: https://issues.apache.org/jira/browse/SPARK-40433 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > Adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40436) Upgrade Scala to 2.12.17
Yang Jie created SPARK-40436: Summary: Upgrade Scala to 2.12.17 Key: SPARK-40436 URL: https://issues.apache.org/jira/browse/SPARK-40436 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: Yang Jie https://github.com/scala/scala/releases/tag/v2.12.17 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40433) Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row
[ https://issues.apache.org/jira/browse/SPARK-40433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605073#comment-17605073 ] Apache Spark commented on SPARK-40433: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/37891 > Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row > > > Key: SPARK-40433 > URL: https://issues.apache.org/jira/browse/SPARK-40433 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40339) Implement `Expanding.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605072#comment-17605072 ] Apache Spark commented on SPARK-40339: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37890 > Implement `Expanding.quantile`. > --- > > Key: SPARK-40339 > URL: https://issues.apache.org/jira/browse/SPARK-40339 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should implement `Expanding.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40342) Implement `Rolling.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605071#comment-17605071 ] Apache Spark commented on SPARK-40342: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37890 > Implement `Rolling.quantile`. > - > > Key: SPARK-40342 > URL: https://issues.apache.org/jira/browse/SPARK-40342 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should implement `Rolling.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40339) Implement `Expanding.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605070#comment-17605070 ] Apache Spark commented on SPARK-40339: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37890 > Implement `Expanding.quantile`. > --- > > Key: SPARK-40339 > URL: https://issues.apache.org/jira/browse/SPARK-40339 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should implement `Expanding.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40432) Introduce GroupStateImpl and GroupStateTimeout in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605068#comment-17605068 ] Apache Spark commented on SPARK-40432: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/37889 > Introduce GroupStateImpl and GroupStateTimeout in PySpark > - > > Key: SPARK-40432 > URL: https://issues.apache.org/jira/browse/SPARK-40432 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala > codebase to support convenient conversion between PySpark implementation and > Scala implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40432) Introduce GroupStateImpl and GroupStateTimeout in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605067#comment-17605067 ] Apache Spark commented on SPARK-40432: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/37889 > Introduce GroupStateImpl and GroupStateTimeout in PySpark > - > > Key: SPARK-40432 > URL: https://issues.apache.org/jira/browse/SPARK-40432 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala > codebase to support convenient conversion between PySpark implementation and > Scala implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40432) Introduce GroupStateImpl and GroupStateTimeout in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40432: Assignee: (was: Apache Spark) > Introduce GroupStateImpl and GroupStateTimeout in PySpark > - > > Key: SPARK-40432 > URL: https://issues.apache.org/jira/browse/SPARK-40432 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > Introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala > codebase to support convenient conversion between PySpark implementation and > Scala implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40432) Introduce GroupStateImpl and GroupStateTimeout in PySpark
[ https://issues.apache.org/jira/browse/SPARK-40432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40432: Assignee: Apache Spark > Introduce GroupStateImpl and GroupStateTimeout in PySpark > - > > Key: SPARK-40432 > URL: https://issues.apache.org/jira/browse/SPARK-40432 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Major > > Introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala > codebase to support convenient conversion between PySpark implementation and > Scala implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40435) Add test suites for applyInPandasWithState in PySpark
Jungtaek Lim created SPARK-40435: Summary: Add test suites for applyInPandasWithState in PySpark Key: SPARK-40435 URL: https://issues.apache.org/jira/browse/SPARK-40435 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Jungtaek Lim Basically port the test suite from Scala/Java version of API to Python API. Have e2e test suite purely implemented with python. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40434) Implement applyInPandasWithState in PySpark
Jungtaek Lim created SPARK-40434: Summary: Implement applyInPandasWithState in PySpark Key: SPARK-40434 URL: https://issues.apache.org/jira/browse/SPARK-40434 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Jungtaek Lim Provide the full implementation of flatMapGroupsWithState equivalent API in PySpark. We could optionally introduce test suites in following JIRA ticket if the PR is too huge. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40433) Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row
Jungtaek Lim created SPARK-40433: Summary: Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row Key: SPARK-40433 URL: https://issues.apache.org/jira/browse/SPARK-40433 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Jungtaek Lim Adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40432) Introduce GroupStateImpl and GroupStateTimeout in PySpark
Jungtaek Lim created SPARK-40432: Summary: Introduce GroupStateImpl and GroupStateTimeout in PySpark Key: SPARK-40432 URL: https://issues.apache.org/jira/browse/SPARK-40432 Project: Spark Issue Type: Sub-task Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Jungtaek Lim Introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala codebase to support convenient conversion between PySpark implementation and Scala implementation. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40431) Introduce "Arbitrary Stateful Processing" in Structured Streaming with Python
[ https://issues.apache.org/jira/browse/SPARK-40431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605063#comment-17605063 ] Jungtaek Lim commented on SPARK-40431: -- This is joint effort between I and [~hyukjin.kwon]. I'll split the PR down to multiple pieces and match each PR to the corresponding subtask. > Introduce "Arbitrary Stateful Processing" in Structured Streaming with Python > - > > Key: SPARK-40431 > URL: https://issues.apache.org/jira/browse/SPARK-40431 > Project: Spark > Issue Type: Umbrella > Components: Structured Streaming >Affects Versions: 3.4.0 >Reporter: Jungtaek Lim >Priority: Major > > This is a part of effort for SPARK-39590, python API parity in Structured > Streaming. > Most of public APIs are available in both Scala/Java Spark and PySpark, but > we have a huge gap on streaming workload in PySpark as we don't have matching > API for flatMapGroupsWithState in PySpark. > This ticket is to track the effort. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40431) Introduce "Arbitrary Stateful Processing" in Structured Streaming with Python
Jungtaek Lim created SPARK-40431: Summary: Introduce "Arbitrary Stateful Processing" in Structured Streaming with Python Key: SPARK-40431 URL: https://issues.apache.org/jira/browse/SPARK-40431 Project: Spark Issue Type: Umbrella Components: Structured Streaming Affects Versions: 3.4.0 Reporter: Jungtaek Lim This is a part of effort for SPARK-39590, python API parity in Structured Streaming. Most of public APIs are available in both Scala/Java Spark and PySpark, but we have a huge gap on streaming workload in PySpark as we don't have matching API for flatMapGroupsWithState in PySpark. This ticket is to track the effort. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40421) Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`
[ https://issues.apache.org/jira/browse/SPARK-40421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-40421. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37874 [https://github.com/apache/spark/pull/37874] > Make `spearman` correlation in `DataFrame.corr` support missing values and > `min_periods` > > > Key: SPARK-40421 > URL: https://issues.apache.org/jira/browse/SPARK-40421 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40421) Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`
[ https://issues.apache.org/jira/browse/SPARK-40421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40421: - Assignee: Ruifeng Zheng > Make `spearman` correlation in `DataFrame.corr` support missing values and > `min_periods` > > > Key: SPARK-40421 > URL: https://issues.apache.org/jira/browse/SPARK-40421 > Project: Spark > Issue Type: Sub-task > Components: ps >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40430) Spark session does not update number of files for partition
[ https://issues.apache.org/jira/browse/SPARK-40430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40430: - Component/s: SQL (was: Spark Core) > Spark session does not update number of files for partition > --- > > Key: SPARK-40430 > URL: https://issues.apache.org/jira/browse/SPARK-40430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 > Environment: I'm using spark 3.1.2 on AWS EMR and AWS Glue as catalog. >Reporter: Filipe Souza >Priority: Minor > Attachments: session 1.png, session 2.png > > > When a spark session has already queried data from a table and partition and > new files are inserted into the partition externally, the spark session keeps > the outdated number of files and does not return the new records. > If the data is inserted into a new partition, the problem will not occur. > Steps to reproduce the behavior: > Open a Spark session > Query a count in a table > Open another spark session > insert data into an existing partition > Check the count again in the first session > I expect to see the inserted records. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters
[ https://issues.apache.org/jira/browse/SPARK-40426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40426. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37871 [https://github.com/apache/spark/pull/37871] > Return a map from SparkThrowable.getMessageParameters > - > > Key: SPARK-40426 > URL: https://issues.apache.org/jira/browse/SPARK-40426 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Change the interface SparkThrowable to return a map from > getMessageParameters(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40342) Implement `Rolling.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40342. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37836 [https://github.com/apache/spark/pull/37836] > Implement `Rolling.quantile`. > - > > Key: SPARK-40342 > URL: https://issues.apache.org/jira/browse/SPARK-40342 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should implement `Rolling.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40339) Implement `Expanding.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40339. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37836 [https://github.com/apache/spark/pull/37836] > Implement `Expanding.quantile`. > --- > > Key: SPARK-40339 > URL: https://issues.apache.org/jira/browse/SPARK-40339 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should implement `Expanding.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40345) Implement `ExpandingGroupby.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40345. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37836 [https://github.com/apache/spark/pull/37836] > Implement `ExpandingGroupby.quantile`. > -- > > Key: SPARK-40345 > URL: https://issues.apache.org/jira/browse/SPARK-40345 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should implement `ExpandingGroupby.quantile` for increasing pandas API > coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40345) Implement `ExpandingGroupby.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40345: Assignee: Yikun Jiang > Implement `ExpandingGroupby.quantile`. > -- > > Key: SPARK-40345 > URL: https://issues.apache.org/jira/browse/SPARK-40345 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > > We should implement `ExpandingGroupby.quantile` for increasing pandas API > coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40342) Implement `Rolling.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40342: Assignee: Yikun Jiang > Implement `Rolling.quantile`. > - > > Key: SPARK-40342 > URL: https://issues.apache.org/jira/browse/SPARK-40342 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > > We should implement `Rolling.quantile` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40348) Implement `RollingGroupby.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40348: Assignee: Yikun Jiang > Implement `RollingGroupby.quantile`. > > > Key: SPARK-40348 > URL: https://issues.apache.org/jira/browse/SPARK-40348 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > > We should implement `RollingGroupby.quantile` for increasing pandas API > coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40348) Implement `RollingGroupby.quantile`.
[ https://issues.apache.org/jira/browse/SPARK-40348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40348. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37836 [https://github.com/apache/spark/pull/37836] > Implement `RollingGroupby.quantile`. > > > Key: SPARK-40348 > URL: https://issues.apache.org/jira/browse/SPARK-40348 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Yikun Jiang >Priority: Major > Fix For: 3.4.0 > > > We should implement `RollingGroupby.quantile` for increasing pandas API > coverage. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40397) Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium to 3.2.13.0
[ https://issues.apache.org/jira/browse/SPARK-40397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-40397. Fix Version/s: 3.4.0 Assignee: Yang Jie Resolution: Fixed Issue resolved in https://github.com/apache/spark/pull/37868 > Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium > to 3.2.13.0 > > > Key: SPARK-40397 > URL: https://issues.apache.org/jira/browse/SPARK-40397 > Project: Spark > Issue Type: Improvement > Components: Build, Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40334) Implement `GroupBy.prod`.
[ https://issues.apache.org/jira/browse/SPARK-40334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-40334: - Assignee: Artsiom Yudovin (was: Haejoon Lee) > Implement `GroupBy.prod`. > - > > Key: SPARK-40334 > URL: https://issues.apache.org/jira/browse/SPARK-40334 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Artsiom Yudovin >Priority: Major > > We should implement `GroupBy.prod` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.prod.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module
[ https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605013#comment-17605013 ] Apache Spark commented on SPARK-40196: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/37888 > Consolidate `lit` function with NumPy scalar in sql and pandas module > - > > Key: SPARK-40196 > URL: https://issues.apache.org/jira/browse/SPARK-40196 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,] > function `lit` with NumPy scalar in sql and pandas module have different > implementations, thus, sql has a less precise result than pandas. > We shall make their result consistent, the more precise, the better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module
[ https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605012#comment-17605012 ] Apache Spark commented on SPARK-40196: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/37888 > Consolidate `lit` function with NumPy scalar in sql and pandas module > - > > Key: SPARK-40196 > URL: https://issues.apache.org/jira/browse/SPARK-40196 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,] > function `lit` with NumPy scalar in sql and pandas module have different > implementations, thus, sql has a less precise result than pandas. > We shall make their result consistent, the more precise, the better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module
[ https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40196: Assignee: (was: Apache Spark) > Consolidate `lit` function with NumPy scalar in sql and pandas module > - > > Key: SPARK-40196 > URL: https://issues.apache.org/jira/browse/SPARK-40196 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,] > function `lit` with NumPy scalar in sql and pandas module have different > implementations, thus, sql has a less precise result than pandas. > We shall make their result consistent, the more precise, the better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module
[ https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40196: Assignee: Apache Spark > Consolidate `lit` function with NumPy scalar in sql and pandas module > - > > Key: SPARK-40196 > URL: https://issues.apache.org/jira/browse/SPARK-40196 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,] > function `lit` with NumPy scalar in sql and pandas module have different > implementations, thus, sql has a less precise result than pandas. > We shall make their result consistent, the more precise, the better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module
[ https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-40196: - Description: Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,] function `lit` with NumPy scalar in sql and pandas module have different implementations, thus, sql has a less precise result than pandas. We shall make their result consistent, the more precise, the better. was: Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,] function `lit` with NumPy input in sql and pandas module have different implementations, thus, sql has a less precise result than pandas. We shall make their result consistent, the more precise, the better. > Consolidate `lit` function with NumPy scalar in sql and pandas module > - > > Key: SPARK-40196 > URL: https://issues.apache.org/jira/browse/SPARK-40196 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,] > function `lit` with NumPy scalar in sql and pandas module have different > implementations, thus, sql has a less precise result than pandas. > We shall make their result consistent, the more precise, the better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module
[ https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-40196: - Summary: Consolidate `lit` function with NumPy scalar in sql and pandas module (was: Consolidate `lit` function with NumPy input in sql and pandas module) > Consolidate `lit` function with NumPy scalar in sql and pandas module > - > > Key: SPARK-40196 > URL: https://issues.apache.org/jira/browse/SPARK-40196 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,] > function `lit` with NumPy input in sql and pandas module have different > implementations, thus, sql has a less precise result than pandas. > We shall make their result consistent, the more precise, the better. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40360) Convert some DDL exception to new error framework
[ https://issues.apache.org/jira/browse/SPARK-40360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605008#comment-17605008 ] Apache Spark commented on SPARK-40360: -- User 'srielau' has created a pull request for this issue: https://github.com/apache/spark/pull/37887 > Convert some DDL exception to new error framework > - > > Key: SPARK-40360 > URL: https://issues.apache.org/jira/browse/SPARK-40360 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: Apache Spark >Priority: Major > > Tackling the following files: > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala > sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala > Here is the doc with proposed text: > https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40429) Only set KeyGroupedPartitioning when the referenced column is in the output
[ https://issues.apache.org/jira/browse/SPARK-40429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-40429: --- Description: {code:java} sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)") sql(s"INSERT INTO $tbl VALUES (1, 'a'), (2, 'b'), (3, 'c')") checkAnswer( spark.table(tbl).select("index", "_partition"), Seq(Row(0, "3"), Row(0, "2"), Row(0, "1")) ) {code} failed with ScalaTestFailureLocation: org.apache.spark.sql.QueryTest at (QueryTest.scala:226) org.scalatest.exceptions.TestFailedException: AttributeSet(id#994L) was not empty The optimized logical plan has missing inputs: RelationV2[index#998, _partition#999] testcat.t > Only set KeyGroupedPartitioning when the referenced column is in the output > --- > > Key: SPARK-40429 > URL: https://issues.apache.org/jira/browse/SPARK-40429 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Huaxin Gao >Priority: Minor > > {code:java} > sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)") > sql(s"INSERT INTO $tbl VALUES (1, 'a'), (2, 'b'), (3, 'c')") > checkAnswer( > spark.table(tbl).select("index", "_partition"), > Seq(Row(0, "3"), Row(0, "2"), Row(0, "1")) > ) > {code} > failed with > ScalaTestFailureLocation: org.apache.spark.sql.QueryTest at > (QueryTest.scala:226) > org.scalatest.exceptions.TestFailedException: AttributeSet(id#994L) was not > empty The optimized logical plan has missing inputs: > RelationV2[index#998, _partition#999] testcat.t -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40430) Spark session does not update number of files for partition
[ https://issues.apache.org/jira/browse/SPARK-40430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Filipe Souza updated SPARK-40430: - Attachment: session 2.png session 1.png > Spark session does not update number of files for partition > --- > > Key: SPARK-40430 > URL: https://issues.apache.org/jira/browse/SPARK-40430 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 > Environment: I'm using spark 3.1.2 on AWS EMR and AWS Glue as catalog. >Reporter: Filipe Souza >Priority: Minor > Attachments: session 1.png, session 2.png > > > When a spark session has already queried data from a table and partition and > new files are inserted into the partition externally, the spark session keeps > the outdated number of files and does not return the new records. > If the data is inserted into a new partition, the problem will not occur. > Steps to reproduce the behavior: > Open a Spark session > Query a count in a table > Open another spark session > insert data into an existing partition > Check the count again in the first session > I expect to see the inserted records. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40430) Spark session does not update number of files for partition
Filipe Souza created SPARK-40430: Summary: Spark session does not update number of files for partition Key: SPARK-40430 URL: https://issues.apache.org/jira/browse/SPARK-40430 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2 Environment: I'm using spark 3.1.2 on AWS EMR and AWS Glue as catalog. Reporter: Filipe Souza When a spark session has already queried data from a table and partition and new files are inserted into the partition externally, the spark session keeps the outdated number of files and does not return the new records. If the data is inserted into a new partition, the problem will not occur. Steps to reproduce the behavior: Open a Spark session Query a count in a table Open another spark session insert data into an existing partition Check the count again in the first session I expect to see the inserted records. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40429) Only set KeyGroupedPartitioning when the referenced column is in the output
[ https://issues.apache.org/jira/browse/SPARK-40429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40429: Assignee: (was: Apache Spark) > Only set KeyGroupedPartitioning when the referenced column is in the output > --- > > Key: SPARK-40429 > URL: https://issues.apache.org/jira/browse/SPARK-40429 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Huaxin Gao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40429) Only set KeyGroupedPartitioning when the referenced column is in the output
[ https://issues.apache.org/jira/browse/SPARK-40429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604961#comment-17604961 ] Apache Spark commented on SPARK-40429: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/37886 > Only set KeyGroupedPartitioning when the referenced column is in the output > --- > > Key: SPARK-40429 > URL: https://issues.apache.org/jira/browse/SPARK-40429 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Huaxin Gao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40429) Only set KeyGroupedPartitioning when the referenced column is in the output
[ https://issues.apache.org/jira/browse/SPARK-40429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40429: Assignee: Apache Spark > Only set KeyGroupedPartitioning when the referenced column is in the output > --- > > Key: SPARK-40429 > URL: https://issues.apache.org/jira/browse/SPARK-40429 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.4.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40429) Only set KeyGroupedPartitioning when the referenced column is in the output
Huaxin Gao created SPARK-40429: -- Summary: Only set KeyGroupedPartitioning when the referenced column is in the output Key: SPARK-40429 URL: https://issues.apache.org/jira/browse/SPARK-40429 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0, 3.4.0 Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40428) Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown
[ https://issues.apache.org/jira/browse/SPARK-40428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604917#comment-17604917 ] Apache Spark commented on SPARK-40428: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/37885 > Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources > during abnormal shutdown > -- > > Key: SPARK-40428 > URL: https://issues.apache.org/jira/browse/SPARK-40428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Priority: Minor > > Add a shutdown hook in the CoarseGrainedSchedulerBackend to call stop > since we've got zombie pods hanging around since the resource tie isn't > perfect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40428) Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown
[ https://issues.apache.org/jira/browse/SPARK-40428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604916#comment-17604916 ] Apache Spark commented on SPARK-40428: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/37885 > Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources > during abnormal shutdown > -- > > Key: SPARK-40428 > URL: https://issues.apache.org/jira/browse/SPARK-40428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Priority: Minor > > Add a shutdown hook in the CoarseGrainedSchedulerBackend to call stop > since we've got zombie pods hanging around since the resource tie isn't > perfect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40428) Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown
[ https://issues.apache.org/jira/browse/SPARK-40428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40428: Assignee: Apache Spark > Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources > during abnormal shutdown > -- > > Key: SPARK-40428 > URL: https://issues.apache.org/jira/browse/SPARK-40428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Minor > > Add a shutdown hook in the CoarseGrainedSchedulerBackend to call stop > since we've got zombie pods hanging around since the resource tie isn't > perfect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40428) Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown
[ https://issues.apache.org/jira/browse/SPARK-40428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40428: Assignee: (was: Apache Spark) > Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources > during abnormal shutdown > -- > > Key: SPARK-40428 > URL: https://issues.apache.org/jira/browse/SPARK-40428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Priority: Minor > > Add a shutdown hook in the CoarseGrainedSchedulerBackend to call stop > since we've got zombie pods hanging around since the resource tie isn't > perfect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40428) Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown
Holden Karau created SPARK-40428: Summary: Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown Key: SPARK-40428 URL: https://issues.apache.org/jira/browse/SPARK-40428 Project: Spark Issue Type: Improvement Components: Kubernetes, Spark Core Affects Versions: 3.4.0 Reporter: Holden Karau Add a shutdown hook in the CoarseGrainedSchedulerBackend to call stop since we've got zombie pods hanging around since the resource tie isn't perfect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40427) Add error classes for LIMIT/OFFSET CheckAnalysis failures
[ https://issues.apache.org/jira/browse/SPARK-40427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40427: Assignee: Apache Spark > Add error classes for LIMIT/OFFSET CheckAnalysis failures > - > > Key: SPARK-40427 > URL: https://issues.apache.org/jira/browse/SPARK-40427 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40427) Add error classes for LIMIT/OFFSET CheckAnalysis failures
[ https://issues.apache.org/jira/browse/SPARK-40427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40427: Assignee: (was: Apache Spark) > Add error classes for LIMIT/OFFSET CheckAnalysis failures > - > > Key: SPARK-40427 > URL: https://issues.apache.org/jira/browse/SPARK-40427 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40427) Add error classes for LIMIT/OFFSET CheckAnalysis failures
[ https://issues.apache.org/jira/browse/SPARK-40427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604883#comment-17604883 ] Apache Spark commented on SPARK-40427: -- User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/37884 > Add error classes for LIMIT/OFFSET CheckAnalysis failures > - > > Key: SPARK-40427 > URL: https://issues.apache.org/jira/browse/SPARK-40427 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40427) Add error classes for LIMIT/OFFSET CheckAnalysis failures
Daniel created SPARK-40427: -- Summary: Add error classes for LIMIT/OFFSET CheckAnalysis failures Key: SPARK-40427 URL: https://issues.apache.org/jira/browse/SPARK-40427 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Daniel -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38017) Fix the API doc for window to say it supports TimestampNTZType too as timeColumn
[ https://issues.apache.org/jira/browse/SPARK-38017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604859#comment-17604859 ] Apache Spark commented on SPARK-38017: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/37883 > Fix the API doc for window to say it supports TimestampNTZType too as > timeColumn > > > Key: SPARK-38017 > URL: https://issues.apache.org/jira/browse/SPARK-38017 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.3.0, 3.2.2 > > > window function supports not only TimestampType but also TimestampNTZType but > the API docs doesn't mention TimestampNTZType. > This issue is similar to SPARK-38016, but this issue affects 3.2.0 too, so I > separate the tickets. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38017) Fix the API doc for window to say it supports TimestampNTZType too as timeColumn
[ https://issues.apache.org/jira/browse/SPARK-38017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604858#comment-17604858 ] Apache Spark commented on SPARK-38017: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/37882 > Fix the API doc for window to say it supports TimestampNTZType too as > timeColumn > > > Key: SPARK-38017 > URL: https://issues.apache.org/jira/browse/SPARK-38017 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.3.0, 3.2.2 > > > window function supports not only TimestampType but also TimestampNTZType but > the API docs doesn't mention TimestampNTZType. > This issue is similar to SPARK-38016, but this issue affects 3.2.0 too, so I > separate the tickets. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38017) Fix the API doc for window to say it supports TimestampNTZType too as timeColumn
[ https://issues.apache.org/jira/browse/SPARK-38017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604857#comment-17604857 ] Apache Spark commented on SPARK-38017: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/37882 > Fix the API doc for window to say it supports TimestampNTZType too as > timeColumn > > > Key: SPARK-38017 > URL: https://issues.apache.org/jira/browse/SPARK-38017 > Project: Spark > Issue Type: Bug > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Minor > Fix For: 3.3.0, 3.2.2 > > > window function supports not only TimestampType but also TimestampNTZType but > the API docs doesn't mention TimestampNTZType. > This issue is similar to SPARK-38016, but this issue affects 3.2.0 too, so I > separate the tickets. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40169) Fix the issue with Parquet column index and predicate pushdown in Data source V1
[ https://issues.apache.org/jira/browse/SPARK-40169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40169: Assignee: Apache Spark > Fix the issue with Parquet column index and predicate pushdown in Data source > V1 > > > Key: SPARK-40169 > URL: https://issues.apache.org/jira/browse/SPARK-40169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.3.1, 3.2.3 >Reporter: Ivan Sadikov >Assignee: Apache Spark >Priority: Major > > This is a follow for SPARK-39833. In > [https://github.com/apache/spark/pull/37419,] we disabled column index for > Parquet due to correctness issues that we found when filtering data on the > partition column overlapping with data schema. > > This ticket is for permanent and thorough fix for the issue and re-enablement > of the column index. See more details in the PR linked above. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40169) Fix the issue with Parquet column index and predicate pushdown in Data source V1
[ https://issues.apache.org/jira/browse/SPARK-40169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604855#comment-17604855 ] Apache Spark commented on SPARK-40169: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/37881 > Fix the issue with Parquet column index and predicate pushdown in Data source > V1 > > > Key: SPARK-40169 > URL: https://issues.apache.org/jira/browse/SPARK-40169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.3.1, 3.2.3 >Reporter: Ivan Sadikov >Priority: Major > > This is a follow for SPARK-39833. In > [https://github.com/apache/spark/pull/37419,] we disabled column index for > Parquet due to correctness issues that we found when filtering data on the > partition column overlapping with data schema. > > This ticket is for permanent and thorough fix for the issue and re-enablement > of the column index. See more details in the PR linked above. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40169) Fix the issue with Parquet column index and predicate pushdown in Data source V1
[ https://issues.apache.org/jira/browse/SPARK-40169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40169: Assignee: (was: Apache Spark) > Fix the issue with Parquet column index and predicate pushdown in Data source > V1 > > > Key: SPARK-40169 > URL: https://issues.apache.org/jira/browse/SPARK-40169 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.3.1, 3.2.3 >Reporter: Ivan Sadikov >Priority: Major > > This is a follow for SPARK-39833. In > [https://github.com/apache/spark/pull/37419,] we disabled column index for > Parquet due to correctness issues that we found when filtering data on the > partition column overlapping with data schema. > > This ticket is for permanent and thorough fix for the issue and re-enablement > of the column index. See more details in the PR linked above. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40334) Implement `GroupBy.prod`.
[ https://issues.apache.org/jira/browse/SPARK-40334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604841#comment-17604841 ] Artsiom Yudovin commented on SPARK-40334: - Got you, thank you so much! > Implement `GroupBy.prod`. > - > > Key: SPARK-40334 > URL: https://issues.apache.org/jira/browse/SPARK-40334 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > We should implement `GroupBy.prod` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.prod.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40423) Add explicit YuniKorn queue submission test coverage
[ https://issues.apache.org/jira/browse/SPARK-40423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40423: -- Fix Version/s: 3.3.2 (was: 3.3.1) > Add explicit YuniKorn queue submission test coverage > > > Key: SPARK-40423 > URL: https://issues.apache.org/jira/browse/SPARK-40423 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.4.0, 3.3.2 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40423) Add explicit YuniKorn queue submission test coverage
[ https://issues.apache.org/jira/browse/SPARK-40423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40423. --- Fix Version/s: 3.3.1 3.4.0 Resolution: Fixed Issue resolved by pull request 37877 [https://github.com/apache/spark/pull/37877] > Add explicit YuniKorn queue submission test coverage > > > Key: SPARK-40423 > URL: https://issues.apache.org/jira/browse/SPARK-40423 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Tests >Affects Versions: 3.3.1 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.3.1, 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters
[ https://issues.apache.org/jira/browse/SPARK-40426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40426: Assignee: Apache Spark (was: Max Gekk) > Return a map from SparkThrowable.getMessageParameters > - > > Key: SPARK-40426 > URL: https://issues.apache.org/jira/browse/SPARK-40426 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Change the interface SparkThrowable to return a map from > getMessageParameters(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters
[ https://issues.apache.org/jira/browse/SPARK-40426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40426: Assignee: Max Gekk (was: Apache Spark) > Return a map from SparkThrowable.getMessageParameters > - > > Key: SPARK-40426 > URL: https://issues.apache.org/jira/browse/SPARK-40426 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Change the interface SparkThrowable to return a map from > getMessageParameters(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters
[ https://issues.apache.org/jira/browse/SPARK-40426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604818#comment-17604818 ] Apache Spark commented on SPARK-40426: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37871 > Return a map from SparkThrowable.getMessageParameters > - > > Key: SPARK-40426 > URL: https://issues.apache.org/jira/browse/SPARK-40426 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Change the interface SparkThrowable to return a map from > getMessageParameters(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters
[ https://issues.apache.org/jira/browse/SPARK-40426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604817#comment-17604817 ] Apache Spark commented on SPARK-40426: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37871 > Return a map from SparkThrowable.getMessageParameters > - > > Key: SPARK-40426 > URL: https://issues.apache.org/jira/browse/SPARK-40426 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Change the interface SparkThrowable to return a map from > getMessageParameters(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters
Max Gekk created SPARK-40426: Summary: Return a map from SparkThrowable.getMessageParameters Key: SPARK-40426 URL: https://issues.apache.org/jira/browse/SPARK-40426 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Change the interface SparkThrowable to return a map from getMessageParameters(). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39399) proxy-user not working for Spark on k8s in cluster deploy mode
[ https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39399: Assignee: (was: Apache Spark) > proxy-user not working for Spark on k8s in cluster deploy mode > -- > > Key: SPARK-39399 > URL: https://issues.apache.org/jira/browse/SPARK-39399 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.2.0 >Reporter: Shrikant Prasad >Priority: Major > > As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user > support was added for Spark on K8s. But the PR only added proxy user argument > on the spark-submit command. The actual functionality of authentication using > the proxy user is not working in case of cluster deploy mode. > We get AccessControlException when trying to access the kerberized HDFS > through a proxy user. > Spark-Submit: > $SPARK_HOME/bin/spark-submit \ > --master \ > --deploy-mode cluster \ > --name with_proxy_user_di \ > --proxy-user \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.kubernetes.container.image= \ > --conf spark.kubernetes.driver.limit.cores=1 \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.kubernetes.namespace= \ > --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \ > --conf spark.eventLog.enabled=true \ > --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \ > --conf spark.kubernetes.file.upload.path=hdfs:///tmp \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar > Driver Logs: > {code:java} > ++ id -u > + myuid=185 > ++ id -g > + mygid=0 > + set +e > ++ getent passwd 185 > + uidentry= > + set -e > + '[' -z '' ']' > + '[' -w /etc/passwd ']' > + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' > + SPARK_CLASSPATH=':/opt/spark/jars/*' > + env > + grep SPARK_JAVA_OPT_ > + sort -t_ -k4 -n > + sed 's/[^=]*=\(.*\)/\1/g' > + readarray -t SPARK_EXECUTOR_JAVA_OPTS > + '[' -n '' ']' > + '[' -z ']' > + '[' -z ']' > + '[' -n '' ']' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*' > + case "$1" in > + shift 1 > + CMD=("$SPARK_HOME/bin/spark-submit" --conf > "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client > "$@") > + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user > --properties-file /opt/spark/conf/spark.properties --class > org.apache.spark.examples.SparkPi spark-internal > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform > (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor > java.nio.DirectByteBuffer(long,int) > WARNING: Please consider reporting this to the maintainers of > org.apache.spark.unsafe.Platform > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful > kerberos logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos > logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, > valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private > org.apache.hadoop.metrics2.lib.MutableGaugeLong > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal > with annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Renewal failures since > startup"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private
[jira] [Assigned] (SPARK-39399) proxy-user not working for Spark on k8s in cluster deploy mode
[ https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39399: Assignee: Apache Spark > proxy-user not working for Spark on k8s in cluster deploy mode > -- > > Key: SPARK-39399 > URL: https://issues.apache.org/jira/browse/SPARK-39399 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.2.0 >Reporter: Shrikant Prasad >Assignee: Apache Spark >Priority: Major > > As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user > support was added for Spark on K8s. But the PR only added proxy user argument > on the spark-submit command. The actual functionality of authentication using > the proxy user is not working in case of cluster deploy mode. > We get AccessControlException when trying to access the kerberized HDFS > through a proxy user. > Spark-Submit: > $SPARK_HOME/bin/spark-submit \ > --master \ > --deploy-mode cluster \ > --name with_proxy_user_di \ > --proxy-user \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.kubernetes.container.image= \ > --conf spark.kubernetes.driver.limit.cores=1 \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.kubernetes.namespace= \ > --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \ > --conf spark.eventLog.enabled=true \ > --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \ > --conf spark.kubernetes.file.upload.path=hdfs:///tmp \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar > Driver Logs: > {code:java} > ++ id -u > + myuid=185 > ++ id -g > + mygid=0 > + set +e > ++ getent passwd 185 > + uidentry= > + set -e > + '[' -z '' ']' > + '[' -w /etc/passwd ']' > + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' > + SPARK_CLASSPATH=':/opt/spark/jars/*' > + env > + grep SPARK_JAVA_OPT_ > + sort -t_ -k4 -n > + sed 's/[^=]*=\(.*\)/\1/g' > + readarray -t SPARK_EXECUTOR_JAVA_OPTS > + '[' -n '' ']' > + '[' -z ']' > + '[' -z ']' > + '[' -n '' ']' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*' > + case "$1" in > + shift 1 > + CMD=("$SPARK_HOME/bin/spark-submit" --conf > "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client > "$@") > + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user > --properties-file /opt/spark/conf/spark.properties --class > org.apache.spark.examples.SparkPi spark-internal > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform > (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor > java.nio.DirectByteBuffer(long,int) > WARNING: Please consider reporting this to the maintainers of > org.apache.spark.unsafe.Platform > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful > kerberos logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos > logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, > valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private > org.apache.hadoop.metrics2.lib.MutableGaugeLong > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal > with annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Renewal failures since > startup"}, valueName="Time") > 22/04/26 08:54:38 DEBUG
[jira] [Commented] (SPARK-39399) proxy-user not working for Spark on k8s in cluster deploy mode
[ https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604810#comment-17604810 ] Apache Spark commented on SPARK-39399: -- User 'shrprasa' has created a pull request for this issue: https://github.com/apache/spark/pull/37880 > proxy-user not working for Spark on k8s in cluster deploy mode > -- > > Key: SPARK-39399 > URL: https://issues.apache.org/jira/browse/SPARK-39399 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 3.2.0 >Reporter: Shrikant Prasad >Priority: Major > > As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user > support was added for Spark on K8s. But the PR only added proxy user argument > on the spark-submit command. The actual functionality of authentication using > the proxy user is not working in case of cluster deploy mode. > We get AccessControlException when trying to access the kerberized HDFS > through a proxy user. > Spark-Submit: > $SPARK_HOME/bin/spark-submit \ > --master \ > --deploy-mode cluster \ > --name with_proxy_user_di \ > --proxy-user \ > --class org.apache.spark.examples.SparkPi \ > --conf spark.kubernetes.container.image= \ > --conf spark.kubernetes.driver.limit.cores=1 \ > --conf spark.executor.instances=1 \ > --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ > --conf spark.kubernetes.namespace= \ > --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \ > --conf spark.eventLog.enabled=true \ > --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \ > --conf spark.kubernetes.file.upload.path=hdfs:///tmp \ > --conf spark.kubernetes.container.image.pullPolicy=Always \ > $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar > Driver Logs: > {code:java} > ++ id -u > + myuid=185 > ++ id -g > + mygid=0 > + set +e > ++ getent passwd 185 > + uidentry= > + set -e > + '[' -z '' ']' > + '[' -w /etc/passwd ']' > + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false' > + SPARK_CLASSPATH=':/opt/spark/jars/*' > + env > + grep SPARK_JAVA_OPT_ > + sort -t_ -k4 -n > + sed 's/[^=]*=\(.*\)/\1/g' > + readarray -t SPARK_EXECUTOR_JAVA_OPTS > + '[' -n '' ']' > + '[' -z ']' > + '[' -z ']' > + '[' -n '' ']' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*' > + '[' -z x ']' > + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*' > + case "$1" in > + shift 1 > + CMD=("$SPARK_HOME/bin/spark-submit" --conf > "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client > "$@") > + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf > spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user > --properties-file /opt/spark/conf/spark.properties --class > org.apache.spark.examples.SparkPi spark-internal > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform > (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor > java.nio.DirectByteBuffer(long,int) > WARNING: Please consider reporting this to the maintainers of > org.apache.spark.unsafe.Platform > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > WARNING: All illegal access operations will be denied in a future release > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful > kerberos logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos > logins and latency (milliseconds)"}, valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field > org.apache.hadoop.metrics2.lib.MutableRate > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with > annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, > valueName="Time") > 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private > org.apache.hadoop.metrics2.lib.MutableGaugeLong > org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal > with annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", > sampleName="Ops", always=false, type=DEFAULT, value={"Renewal failures
[jira] [Commented] (SPARK-40425) DROP TABLE does not need to do table lookup
[ https://issues.apache.org/jira/browse/SPARK-40425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604777#comment-17604777 ] Apache Spark commented on SPARK-40425: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/37879 > DROP TABLE does not need to do table lookup > --- > > Key: SPARK-40425 > URL: https://issues.apache.org/jira/browse/SPARK-40425 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org