date:20220914

[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40439:
-
Description: 
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting {{spark.sql.ansi.enabled}} _to {{False}}_  failed as well (which 
may be an independent issue).

  was:
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision

[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40439:
-
Description: 
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting {{spark.sql.ansi.enabled}} _to {{False}}_  failed as well (which 
may be an independent issue).

 

 

 

 

  was:
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into

[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40439:
-
Description: 
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow{}}} is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
 

The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting {{spark.sql.ansi.enabled}} _to {{False}}_  failed as well (which 
may be an independent issue).

 

 

 

 

  was:
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into

[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40439:
-
Description: 
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow {}}}is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):

 
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
 

The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting _spark.sql.ansi.enabled to False_ failed as well (which may be an 
independent issue).

  was:
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
 
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision

[jira] [Created] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)

xsys created SPARK-40439:


 Summary: DECIMAL value with more precision than what is defined in 
the schema raises exception in SparkSQL but evaluates to NULL for DataFrame
 Key: SPARK-40439
 URL: https://issues.apache.org/jira/browse/SPARK-40439
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:

 
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
 
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:

 
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
 
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow {}}}is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):

 
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
 

The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting {{spark.sql.ansi.enabled }}to False failed as well (which may be 
an independent issue).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40439) DECIMAL value with more precision than what is defined in the schema raises exception in SparkSQL but evaluates to NULL for DataFrame

2022-09-14 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-40439:
-
Description: 
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
 
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
h3. Expected behavior

We expect the two Spark interfaces ({{{}spark-sql{}}} & {{{}spark-shell{}}}) to 
behave consistently for the same data type & input combination 
({{{}DECIMAL(20,10){}}} and {{{}333.22{}}}). 

Here is a simplified example in {{{}spark-shell{}}}, where insertion of the 
aforementioned decimal value evaluates to a {{{}NULL{}}}:
{code:java}
scala> import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.{Row, SparkSession}
scala> import org.apache.spark.sql.types._
import org.apache.spark.sql.types._
scala> val rdd = sc.parallelize(Seq(Row(BigDecimal("333.22"
rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = 
ParallelCollectionRDD[0] at parallelize at :27
scala> val schema = new StructType().add(StructField("c1", DecimalType(20, 10), 
true))
schema: org.apache.spark.sql.types.StructType = 
StructType(StructField(c1,DecimalType(20,10),true))
scala> val df = spark.createDataFrame(rdd, schema)
df: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
scala> df.show()
++
|  c1|
++
|null|
++
scala> 
df.write.mode("overwrite").format("orc").saveAsTable("decimal_extra_precision")
22/08/29 10:33:47 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
since hive.security.authorization.manager is set to instance of 
HiveAuthorizerFactory.
scala> spark.sql("select * from decimal_extra_precision;")
res2: org.apache.spark.sql.DataFrame = [c1: decimal(20,10)]
{code}
h3. Root Cause

The exception is being raised from 
[Decimal|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala#L358-L373]
 ({{{}nullOnOverflow {{is controlled by {{spark.sql.ansi.enabled}} in 
[SQLConf|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L2542-L2551].):

 
{code:java}
  private[sql] def toPrecision(
      precision: Int,
      scale: Int,
      roundMode: BigDecimal.RoundingMode.Value = ROUND_HALF_UP,
      nullOnOverflow: Boolean = true,
      context: SQLQueryContext = null): Decimal = {
    val copy = clone()
    if (copy.changePrecision(precision, scale, roundMode)) {
      copy
    } else {
      if (nullOnOverflow) {
        null
      } else {
        throw QueryExecutionErrors.cannotChangeDecimalPrecisionError(
          this, precision, scale, context)
      }
    }
  }{code}
 

The above function is invoked from 
[toPrecision|https://github.com/apache/spark/blob/v3.2.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala#L754-L756]
 (in Cast.scala).  However, our attempt to insert {{333.22}} 
after setting \{{spark.sql.ansi.enabled }}to False failed as well (which may be 
an independent issue).

  was:
h3. Describe the bug

We are trying to store a DECIMAL value {{333.22}} with more 
precision than what is defined in the schema: {{{}DECIMAL(20,10){}}}. This 
leads to a {{NULL}} value being stored if the table is created using DataFrames 
via {{{}spark-shell{}}}. However, it leads to the following exception if the 
table is created via {{{}spark-sql{}}}:

 
{code:java}
Failed in [insert into decimal_extra_precision select 333.22]
java.lang.ArithmeticException: Decimal(expanded,333.22,21,10}) 
cannot be represented as Decimal(20, 10){code}
 
h3. Step to reproduce:

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{{}spark-sql{}}}:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into decimal_extra_precision select 333.22;{code}
Execute the following:
{code:java}
create table decimal_extra_precision(c1 DECIMAL(20,10)) STORED AS ORC;
insert into

[jira] [Commented] (SPARK-40435) Add test suites for applyInPandasWithState in PySpark

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605101#comment-17605101
 ] 

Apache Spark commented on SPARK-40435:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37894

> Add test suites for applyInPandasWithState in PySpark
> -
>
> Key: SPARK-40435
> URL: https://issues.apache.org/jira/browse/SPARK-40435
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Basically port the test suite from Scala/Java version of API to Python API. 
> Have e2e test suite purely implemented with python.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40435) Add test suites for applyInPandasWithState in PySpark

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605100#comment-17605100
 ] 

Apache Spark commented on SPARK-40435:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37894

> Add test suites for applyInPandasWithState in PySpark
> -
>
> Key: SPARK-40435
> URL: https://issues.apache.org/jira/browse/SPARK-40435
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Basically port the test suite from Scala/Java version of API to Python API. 
> Have e2e test suite purely implemented with python.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40435) Add test suites for applyInPandasWithState in PySpark

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40435:


Assignee: (was: Apache Spark)

> Add test suites for applyInPandasWithState in PySpark
> -
>
> Key: SPARK-40435
> URL: https://issues.apache.org/jira/browse/SPARK-40435
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Basically port the test suite from Scala/Java version of API to Python API. 
> Have e2e test suite purely implemented with python.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40435) Add test suites for applyInPandasWithState in PySpark

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40435:


Assignee: Apache Spark

> Add test suites for applyInPandasWithState in PySpark
> -
>
> Key: SPARK-40435
> URL: https://issues.apache.org/jira/browse/SPARK-40435
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> Basically port the test suite from Scala/Java version of API to Python API. 
> Have e2e test suite purely implemented with python.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40434) Implement applyInPandasWithState in PySpark

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40434:


Assignee: Apache Spark

> Implement applyInPandasWithState in PySpark
> ---
>
> Key: SPARK-40434
> URL: https://issues.apache.org/jira/browse/SPARK-40434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> Provide the full implementation of flatMapGroupsWithState equivalent API in 
> PySpark. We could optionally introduce test suites in following JIRA ticket 
> if the PR is too huge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40434) Implement applyInPandasWithState in PySpark

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40434:


Assignee: (was: Apache Spark)

> Implement applyInPandasWithState in PySpark
> ---
>
> Key: SPARK-40434
> URL: https://issues.apache.org/jira/browse/SPARK-40434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Provide the full implementation of flatMapGroupsWithState equivalent API in 
> PySpark. We could optionally introduce test suites in following JIRA ticket 
> if the PR is too huge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40434) Implement applyInPandasWithState in PySpark

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605095#comment-17605095
 ] 

Apache Spark commented on SPARK-40434:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37893

> Implement applyInPandasWithState in PySpark
> ---
>
> Key: SPARK-40434
> URL: https://issues.apache.org/jira/browse/SPARK-40434
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Provide the full implementation of flatMapGroupsWithState equivalent API in 
> PySpark. We could optionally introduce test suites in following JIRA ticket 
> if the PR is too huge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40437) Support string representation of durationMs in GroupState.setTimeoutDuration

2022-09-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605089#comment-17605089
 ] 

Hyukjin Kwon commented on SPARK-40437:
--

[~kabhwan] I didn't add this to SPARK-40431 because I think this isn't our 
priority for the initial implementation but feel free to add it. I don't mind 
either way.

> Support string representation of durationMs in GroupState.setTimeoutDuration
> 
>
> Key: SPARK-40437
> URL: https://issues.apache.org/jira/browse/SPARK-40437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> GroupStateImpl.setTimeoutDuration should support string representation to 
> match with Scala's side support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40438) Support additionalDuration parameter in GroupState.setTimeoutTimestamp

2022-09-14 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605088#comment-17605088
 ] 

Hyukjin Kwon commented on SPARK-40438:
--

[~kabhwan] I didn't add this to SPARK-40431 because I think this isn't our 
priority for the initial implementation but feel free to add it. I don't mind 
either way.

> Support additionalDuration parameter in GroupState.setTimeoutTimestamp
> --
>
> Key: SPARK-40438
> URL: https://issues.apache.org/jira/browse/SPARK-40438
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> GroupState.setTimeoutTimestamp should support additionalDuration parameter to 
> match with Scala's side support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40438) Support additionalDuration parameter in GroupState.setTimeoutTimestamp

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40438:
-
Summary: Support additionalDuration parameter in 
GroupState.setTimeoutTimestamp  (was: Support  in 
GroupState.setTimeoutTimestamp)

> Support additionalDuration parameter in GroupState.setTimeoutTimestamp
> --
>
> Key: SPARK-40438
> URL: https://issues.apache.org/jira/browse/SPARK-40438
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> GroupStateImpl.additionalDuration should support string representation to 
> match with Scala's side support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40438) Support additionalDuration parameter in GroupState.setTimeoutTimestamp

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40438:
-
Description: GroupState.setTimeoutTimestamp should support 
additionalDuration parameter to match with Scala's side support.  (was: 
GroupState.setTimeoutTimestamp should support string representation to match 
with Scala's side support.)

> Support additionalDuration parameter in GroupState.setTimeoutTimestamp
> --
>
> Key: SPARK-40438
> URL: https://issues.apache.org/jira/browse/SPARK-40438
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> GroupState.setTimeoutTimestamp should support additionalDuration parameter to 
> match with Scala's side support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40438) Support in GroupState.setTimeoutTimestamp

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40438:
-
Summary: Support  in GroupState.setTimeoutTimestamp  (was: Implement 
additionalDuration parameter in GroupState)

> Support  in GroupState.setTimeoutTimestamp
> --
>
> Key: SPARK-40438
> URL: https://issues.apache.org/jira/browse/SPARK-40438
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> GroupStateImpl.additionalDuration should support string representation to 
> match with Scala's side support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40437) Support string representation of durationMs in GroupState.setTimeoutDuration

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40437:
-
Summary: Support string representation of durationMs in 
GroupState.setTimeoutDuration  (was: Support string representation of 
durationMs in GroupState)

> Support string representation of durationMs in GroupState.setTimeoutDuration
> 
>
> Key: SPARK-40437
> URL: https://issues.apache.org/jira/browse/SPARK-40437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> GroupStateImpl.setTimeoutDuration should support string representation to 
> match with Scala's side support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40438) Implement additionalDuration parameter in GroupState

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40438:
-
Description: GroupStateImpl.additionalDuration should support string 
representation to match with Scala's side support.  (was: 
GroupStateImpl.setTimeoutDuration should support string representation to match 
with Scala's side support.)

> Implement additionalDuration parameter in GroupState
> 
>
> Key: SPARK-40438
> URL: https://issues.apache.org/jira/browse/SPARK-40438
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> GroupStateImpl.additionalDuration should support string representation to 
> match with Scala's side support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40438) Support additionalDuration parameter in GroupState.setTimeoutTimestamp

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40438:
-
Description: GroupState.setTimeoutTimestamp should support string 
representation to match with Scala's side support.  (was: 
GroupStateImpl.additionalDuration should support string representation to match 
with Scala's side support.)

> Support additionalDuration parameter in GroupState.setTimeoutTimestamp
> --
>
> Key: SPARK-40438
> URL: https://issues.apache.org/jira/browse/SPARK-40438
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> GroupState.setTimeoutTimestamp should support string representation to match 
> with Scala's side support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40438) Implement additionalDuration parameter in GroupState

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40438:
-
Description: GroupStateImpl.setTimeoutDuration should support string 
representation to match with Scala's side support.

> Implement additionalDuration parameter in GroupState
> 
>
> Key: SPARK-40438
> URL: https://issues.apache.org/jira/browse/SPARK-40438
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> GroupStateImpl.setTimeoutDuration should support string representation to 
> match with Scala's side support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40438) Implement additionalDuration parameter in GroupState

2022-09-14 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-40438:


 Summary: Implement additionalDuration parameter in GroupState
 Key: SPARK-40438
 URL: https://issues.apache.org/jira/browse/SPARK-40438
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40437) Support string representation of durationMs in GroupState

2022-09-14 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-40437:


 Summary: Support string representation of durationMs in GroupState
 Key: SPARK-40437
 URL: https://issues.apache.org/jira/browse/SPARK-40437
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


GroupStateImpl.setTimeoutDuration should support string representation to match 
with Scala's side support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40437) Support string representation of durationMs in GroupState

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40437:
-
Priority: Minor  (was: Major)

> Support string representation of durationMs in GroupState
> -
>
> Key: SPARK-40437
> URL: https://issues.apache.org/jira/browse/SPARK-40437
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> GroupStateImpl.setTimeoutDuration should support string representation to 
> match with Scala's side support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40436) Upgrade Scala to 2.12.17

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605082#comment-17605082
 ] 

Apache Spark commented on SPARK-40436:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37892

> Upgrade Scala to 2.12.17
> 
>
> Key: SPARK-40436
> URL: https://issues.apache.org/jira/browse/SPARK-40436
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/scala/scala/releases/tag/v2.12.17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40436) Upgrade Scala to 2.12.17

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40436:


Assignee: (was: Apache Spark)

> Upgrade Scala to 2.12.17
> 
>
> Key: SPARK-40436
> URL: https://issues.apache.org/jira/browse/SPARK-40436
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/scala/scala/releases/tag/v2.12.17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40436) Upgrade Scala to 2.12.17

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40436:


Assignee: Apache Spark

> Upgrade Scala to 2.12.17
> 
>
> Key: SPARK-40436
> URL: https://issues.apache.org/jira/browse/SPARK-40436
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> https://github.com/scala/scala/releases/tag/v2.12.17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40436) Upgrade Scala to 2.12.17

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605081#comment-17605081
 ] 

Apache Spark commented on SPARK-40436:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/37892

> Upgrade Scala to 2.12.17
> 
>
> Key: SPARK-40436
> URL: https://issues.apache.org/jira/browse/SPARK-40436
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>
> https://github.com/scala/scala/releases/tag/v2.12.17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40433) Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40433:


Assignee: (was: Apache Spark)

> Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row
> 
>
> Key: SPARK-40433
> URL: https://issues.apache.org/jira/browse/SPARK-40433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40433) Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40433:


Assignee: Apache Spark

> Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row
> 
>
> Key: SPARK-40433
> URL: https://issues.apache.org/jira/browse/SPARK-40433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> Adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40436) Upgrade Scala to 2.12.17

2022-09-14 Thread Yang Jie (Jira)

Yang Jie created SPARK-40436:


 Summary: Upgrade Scala to 2.12.17
 Key: SPARK-40436
 URL: https://issues.apache.org/jira/browse/SPARK-40436
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Yang Jie


https://github.com/scala/scala/releases/tag/v2.12.17



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40433) Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605073#comment-17605073
 ] 

Apache Spark commented on SPARK-40433:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37891

> Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row
> 
>
> Key: SPARK-40433
> URL: https://issues.apache.org/jira/browse/SPARK-40433
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40339) Implement `Expanding.quantile`.

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605072#comment-17605072
 ] 

Apache Spark commented on SPARK-40339:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37890

> Implement `Expanding.quantile`.
> ---
>
> Key: SPARK-40339
> URL: https://issues.apache.org/jira/browse/SPARK-40339
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should implement `Expanding.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40342) Implement `Rolling.quantile`.

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605071#comment-17605071
 ] 

Apache Spark commented on SPARK-40342:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37890

> Implement `Rolling.quantile`.
> -
>
> Key: SPARK-40342
> URL: https://issues.apache.org/jira/browse/SPARK-40342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should implement `Rolling.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40339) Implement `Expanding.quantile`.

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605070#comment-17605070
 ] 

Apache Spark commented on SPARK-40339:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/37890

> Implement `Expanding.quantile`.
> ---
>
> Key: SPARK-40339
> URL: https://issues.apache.org/jira/browse/SPARK-40339
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should implement `Expanding.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40432) Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605068#comment-17605068
 ] 

Apache Spark commented on SPARK-40432:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37889

> Introduce GroupStateImpl and GroupStateTimeout in PySpark
> -
>
> Key: SPARK-40432
> URL: https://issues.apache.org/jira/browse/SPARK-40432
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala 
> codebase to support convenient conversion between PySpark implementation and 
> Scala implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40432) Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605067#comment-17605067
 ] 

Apache Spark commented on SPARK-40432:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/37889

> Introduce GroupStateImpl and GroupStateTimeout in PySpark
> -
>
> Key: SPARK-40432
> URL: https://issues.apache.org/jira/browse/SPARK-40432
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala 
> codebase to support convenient conversion between PySpark implementation and 
> Scala implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40432) Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40432:


Assignee: (was: Apache Spark)

> Introduce GroupStateImpl and GroupStateTimeout in PySpark
> -
>
> Key: SPARK-40432
> URL: https://issues.apache.org/jira/browse/SPARK-40432
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> Introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala 
> codebase to support convenient conversion between PySpark implementation and 
> Scala implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40432) Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40432:


Assignee: Apache Spark

> Introduce GroupStateImpl and GroupStateTimeout in PySpark
> -
>
> Key: SPARK-40432
> URL: https://issues.apache.org/jira/browse/SPARK-40432
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> Introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala 
> codebase to support convenient conversion between PySpark implementation and 
> Scala implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40435) Add test suites for applyInPandasWithState in PySpark

2022-09-14 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-40435:


 Summary: Add test suites for applyInPandasWithState in PySpark
 Key: SPARK-40435
 URL: https://issues.apache.org/jira/browse/SPARK-40435
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


Basically port the test suite from Scala/Java version of API to Python API. 
Have e2e test suite purely implemented with python.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40434) Implement applyInPandasWithState in PySpark

2022-09-14 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-40434:


 Summary: Implement applyInPandasWithState in PySpark
 Key: SPARK-40434
 URL: https://issues.apache.org/jira/browse/SPARK-40434
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


Provide the full implementation of flatMapGroupsWithState equivalent API in 
PySpark. We could optionally introduce test suites in following JIRA ticket if 
the PR is too huge.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40433) Add toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row

2022-09-14 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-40433:


 Summary: Add toJVMRow in PythonSQLUtils to convert pickled PySpark 
Row to JVM Row
 Key: SPARK-40433
 URL: https://issues.apache.org/jira/browse/SPARK-40433
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


Adds toJVMRow in PythonSQLUtils to convert pickled PySpark Row to JVM Row.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40432) Introduce GroupStateImpl and GroupStateTimeout in PySpark

2022-09-14 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-40432:


 Summary: Introduce GroupStateImpl and GroupStateTimeout in PySpark
 Key: SPARK-40432
 URL: https://issues.apache.org/jira/browse/SPARK-40432
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


Introduces GroupStateImpl and GroupStateTimeout in PySpark, and updates Scala 
codebase to support convenient conversion between PySpark implementation and 
Scala implementation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40431) Introduce "Arbitrary Stateful Processing" in Structured Streaming with Python

2022-09-14 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605063#comment-17605063
 ] 

Jungtaek Lim commented on SPARK-40431:
--

This is joint effort between I and [~hyukjin.kwon]. I'll split the PR down to 
multiple pieces and match each PR to the corresponding subtask.

> Introduce "Arbitrary Stateful Processing" in Structured Streaming with Python
> -
>
> Key: SPARK-40431
> URL: https://issues.apache.org/jira/browse/SPARK-40431
> Project: Spark
>  Issue Type: Umbrella
>  Components: Structured Streaming
>Affects Versions: 3.4.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This is a part of effort for SPARK-39590, python API parity in Structured 
> Streaming.
> Most of public APIs are available in both Scala/Java Spark and PySpark, but 
> we have a huge gap on streaming workload in PySpark as we don't have matching 
> API for flatMapGroupsWithState in PySpark.
> This ticket is to track the effort.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40431) Introduce "Arbitrary Stateful Processing" in Structured Streaming with Python

2022-09-14 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-40431:


 Summary: Introduce "Arbitrary Stateful Processing" in Structured 
Streaming with Python
 Key: SPARK-40431
 URL: https://issues.apache.org/jira/browse/SPARK-40431
 Project: Spark
  Issue Type: Umbrella
  Components: Structured Streaming
Affects Versions: 3.4.0
Reporter: Jungtaek Lim


This is a part of effort for SPARK-39590, python API parity in Structured 
Streaming.

Most of public APIs are available in both Scala/Java Spark and PySpark, but we 
have a huge gap on streaming workload in PySpark as we don't have matching API 
for flatMapGroupsWithState in PySpark.

This ticket is to track the effort.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40421) Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`

2022-09-14 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-40421.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37874
[https://github.com/apache/spark/pull/37874]

> Make `spearman` correlation in `DataFrame.corr` support missing values and 
> `min_periods`
> 
>
> Key: SPARK-40421
> URL: https://issues.apache.org/jira/browse/SPARK-40421
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40421) Make `spearman` correlation in `DataFrame.corr` support missing values and `min_periods`

2022-09-14 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40421:
-

Assignee: Ruifeng Zheng

> Make `spearman` correlation in `DataFrame.corr` support missing values and 
> `min_periods`
> 
>
> Key: SPARK-40421
> URL: https://issues.apache.org/jira/browse/SPARK-40421
> Project: Spark
>  Issue Type: Sub-task
>  Components: ps
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40430) Spark session does not update number of files for partition

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-40430:
-
Component/s: SQL
 (was: Spark Core)

> Spark session does not update number of files for partition
> ---
>
> Key: SPARK-40430
> URL: https://issues.apache.org/jira/browse/SPARK-40430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: I'm using spark 3.1.2 on AWS EMR and AWS Glue as catalog.
>Reporter: Filipe Souza
>Priority: Minor
> Attachments: session 1.png, session 2.png
>
>
> When a spark session has already queried data from a table and partition and 
> new files are inserted into the partition externally, the spark session keeps 
> the outdated number of files and does not return the new records.
> If the data is inserted into a new partition, the problem will not occur.
> Steps to reproduce the behavior:
> Open a Spark session
> Query a count in a table
> Open another spark session
> insert data into an existing partition
> Check the count again in the first session
> I expect to see the inserted records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40426.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37871
[https://github.com/apache/spark/pull/37871]

> Return a map from SparkThrowable.getMessageParameters
> -
>
> Key: SPARK-40426
> URL: https://issues.apache.org/jira/browse/SPARK-40426
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.4.0
>
>
> Change the interface SparkThrowable to return a map from 
> getMessageParameters().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40342) Implement `Rolling.quantile`.

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40342.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37836
[https://github.com/apache/spark/pull/37836]

> Implement `Rolling.quantile`.
> -
>
> Key: SPARK-40342
> URL: https://issues.apache.org/jira/browse/SPARK-40342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should implement `Rolling.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40339) Implement `Expanding.quantile`.

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40339.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37836
[https://github.com/apache/spark/pull/37836]

> Implement `Expanding.quantile`.
> ---
>
> Key: SPARK-40339
> URL: https://issues.apache.org/jira/browse/SPARK-40339
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should implement `Expanding.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.expanding.Expanding.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40345) Implement `ExpandingGroupby.quantile`.

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40345.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37836
[https://github.com/apache/spark/pull/37836]

> Implement `ExpandingGroupby.quantile`.
> --
>
> Key: SPARK-40345
> URL: https://issues.apache.org/jira/browse/SPARK-40345
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should implement `ExpandingGroupby.quantile` for increasing pandas API 
> coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40345) Implement `ExpandingGroupby.quantile`.

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40345:


Assignee: Yikun Jiang

> Implement `ExpandingGroupby.quantile`.
> --
>
> Key: SPARK-40345
> URL: https://issues.apache.org/jira/browse/SPARK-40345
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
>
> We should implement `ExpandingGroupby.quantile` for increasing pandas API 
> coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40342) Implement `Rolling.quantile`.

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40342:


Assignee: Yikun Jiang

> Implement `Rolling.quantile`.
> -
>
> Key: SPARK-40342
> URL: https://issues.apache.org/jira/browse/SPARK-40342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
>
> We should implement `Rolling.quantile` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.window.rolling.Rolling.quantile.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40348) Implement `RollingGroupby.quantile`.

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-40348:


Assignee: Yikun Jiang

> Implement `RollingGroupby.quantile`.
> 
>
> Key: SPARK-40348
> URL: https://issues.apache.org/jira/browse/SPARK-40348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
>
> We should implement `RollingGroupby.quantile` for increasing pandas API 
> coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40348) Implement `RollingGroupby.quantile`.

2022-09-14 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-40348.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 37836
[https://github.com/apache/spark/pull/37836]

> Implement `RollingGroupby.quantile`.
> 
>
> Key: SPARK-40348
> URL: https://issues.apache.org/jira/browse/SPARK-40348
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.4.0
>
>
> We should implement `RollingGroupby.quantile` for increasing pandas API 
> coverage.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40397) Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium to 3.2.13.0

2022-09-14 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-40397.

Fix Version/s: 3.4.0
 Assignee: Yang Jie
   Resolution: Fixed

Issue resolved in https://github.com/apache/spark/pull/37868

> Migrate selenium-java from 3.1 to 4.2 and upgrade org.scalatestplus:selenium 
> to 3.2.13.0
> 
>
> Key: SPARK-40397
> URL: https://issues.apache.org/jira/browse/SPARK-40397
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Tests
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40334) Implement `GroupBy.prod`.

2022-09-14 Thread Ruifeng Zheng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-40334:
-

Assignee: Artsiom Yudovin  (was: Haejoon Lee)

> Implement `GroupBy.prod`.
> -
>
> Key: SPARK-40334
> URL: https://issues.apache.org/jira/browse/SPARK-40334
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Artsiom Yudovin
>Priority: Major
>
> We should implement `GroupBy.prod` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.prod.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605013#comment-17605013
 ] 

Apache Spark commented on SPARK-40196:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37888

> Consolidate `lit` function with NumPy scalar in sql and pandas module
> -
>
> Key: SPARK-40196
> URL: https://issues.apache.org/jira/browse/SPARK-40196
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]
> function `lit` with NumPy scalar in sql and pandas module have different 
> implementations, thus, sql has a less precise result than pandas.
> We shall make their result consistent, the more precise, the better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605012#comment-17605012
 ] 

Apache Spark commented on SPARK-40196:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/37888

> Consolidate `lit` function with NumPy scalar in sql and pandas module
> -
>
> Key: SPARK-40196
> URL: https://issues.apache.org/jira/browse/SPARK-40196
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]
> function `lit` with NumPy scalar in sql and pandas module have different 
> implementations, thus, sql has a less precise result than pandas.
> We shall make their result consistent, the more precise, the better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40196:


Assignee: (was: Apache Spark)

> Consolidate `lit` function with NumPy scalar in sql and pandas module
> -
>
> Key: SPARK-40196
> URL: https://issues.apache.org/jira/browse/SPARK-40196
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]
> function `lit` with NumPy scalar in sql and pandas module have different 
> implementations, thus, sql has a less precise result than pandas.
> We shall make their result consistent, the more precise, the better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40196:


Assignee: Apache Spark

> Consolidate `lit` function with NumPy scalar in sql and pandas module
> -
>
> Key: SPARK-40196
> URL: https://issues.apache.org/jira/browse/SPARK-40196
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]
> function `lit` with NumPy scalar in sql and pandas module have different 
> implementations, thus, sql has a less precise result than pandas.
> We shall make their result consistent, the more precise, the better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-14 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40196:
-
Description: 
Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]

function `lit` with NumPy scalar in sql and pandas module have different 
implementations, thus, sql has a less precise result than pandas.

We shall make their result consistent, the more precise, the better.

  was:
Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]

function `lit` with NumPy input in sql and pandas module have different 
implementations, thus, sql has a less precise result than pandas.

We shall make their result consistent, the more precise, the better.


> Consolidate `lit` function with NumPy scalar in sql and pandas module
> -
>
> Key: SPARK-40196
> URL: https://issues.apache.org/jira/browse/SPARK-40196
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]
> function `lit` with NumPy scalar in sql and pandas module have different 
> implementations, thus, sql has a less precise result than pandas.
> We shall make their result consistent, the more precise, the better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40196) Consolidate `lit` function with NumPy scalar in sql and pandas module

2022-09-14 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-40196:
-
Summary: Consolidate `lit` function with NumPy scalar in sql and pandas 
module  (was: Consolidate `lit` function with NumPy input in sql and pandas 
module)

> Consolidate `lit` function with NumPy scalar in sql and pandas module
> -
>
> Key: SPARK-40196
> URL: https://issues.apache.org/jira/browse/SPARK-40196
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Per [https://github.com/apache/spark/pull/37560#discussion_r952882996,]
> function `lit` with NumPy input in sql and pandas module have different 
> implementations, thus, sql has a less precise result than pandas.
> We shall make their result consistent, the more precise, the better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40360) Convert some DDL exception to new error framework

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605008#comment-17605008
 ] 

Apache Spark commented on SPARK-40360:
--

User 'srielau' has created a pull request for this issue:
https://github.com/apache/spark/pull/37887

> Convert some DDL exception to new error framework
> -
>
> Key: SPARK-40360
> URL: https://issues.apache.org/jira/browse/SPARK-40360
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Assignee: Apache Spark
>Priority: Major
>
> Tackling the following files:
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/AlreadyExistException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CannotReplaceMissingTableException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NonEmptyException.scala
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala
> Here is the doc with proposed text:
> https://docs.google.com/document/d/1TpFx3AwcJZd3l7zB1ZDchvZ8j2dY6_uf5LHfW2gjE4A/edit?usp=sharing



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40429) Only set KeyGroupedPartitioning when the referenced column is in the output

2022-09-14 Thread Huaxin Gao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-40429:
---
Description: 

{code:java}
  sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)")
  sql(s"INSERT INTO $tbl VALUES (1, 'a'), (2, 'b'), (3, 'c')")
  checkAnswer(
spark.table(tbl).select("index", "_partition"),
Seq(Row(0, "3"), Row(0, "2"), Row(0, "1"))
  )
{code}

failed with 
ScalaTestFailureLocation: org.apache.spark.sql.QueryTest at 
(QueryTest.scala:226)
org.scalatest.exceptions.TestFailedException: AttributeSet(id#994L) was not 
empty The optimized logical plan has missing inputs:
RelationV2[index#998, _partition#999] testcat.t


> Only set KeyGroupedPartitioning when the referenced column is in the output
> ---
>
> Key: SPARK-40429
> URL: https://issues.apache.org/jira/browse/SPARK-40429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> {code:java}
>   sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)")
>   sql(s"INSERT INTO $tbl VALUES (1, 'a'), (2, 'b'), (3, 'c')")
>   checkAnswer(
> spark.table(tbl).select("index", "_partition"),
> Seq(Row(0, "3"), Row(0, "2"), Row(0, "1"))
>   )
> {code}
> failed with 
> ScalaTestFailureLocation: org.apache.spark.sql.QueryTest at 
> (QueryTest.scala:226)
> org.scalatest.exceptions.TestFailedException: AttributeSet(id#994L) was not 
> empty The optimized logical plan has missing inputs:
> RelationV2[index#998, _partition#999] testcat.t



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40430) Spark session does not update number of files for partition

2022-09-14 Thread Filipe Souza (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Filipe Souza updated SPARK-40430:
-
Attachment: session 2.png
session 1.png

> Spark session does not update number of files for partition
> ---
>
> Key: SPARK-40430
> URL: https://issues.apache.org/jira/browse/SPARK-40430
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
> Environment: I'm using spark 3.1.2 on AWS EMR and AWS Glue as catalog.
>Reporter: Filipe Souza
>Priority: Minor
> Attachments: session 1.png, session 2.png
>
>
> When a spark session has already queried data from a table and partition and 
> new files are inserted into the partition externally, the spark session keeps 
> the outdated number of files and does not return the new records.
> If the data is inserted into a new partition, the problem will not occur.
> Steps to reproduce the behavior:
> Open a Spark session
> Query a count in a table
> Open another spark session
> insert data into an existing partition
> Check the count again in the first session
> I expect to see the inserted records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40430) Spark session does not update number of files for partition

2022-09-14 Thread Filipe Souza (Jira)

Filipe Souza created SPARK-40430:


 Summary: Spark session does not update number of files for 
partition
 Key: SPARK-40430
 URL: https://issues.apache.org/jira/browse/SPARK-40430
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
 Environment: I'm using spark 3.1.2 on AWS EMR and AWS Glue as catalog.
Reporter: Filipe Souza


When a spark session has already queried data from a table and partition and 
new files are inserted into the partition externally, the spark session keeps 
the outdated number of files and does not return the new records.
If the data is inserted into a new partition, the problem will not occur.

Steps to reproduce the behavior:

Open a Spark session
Query a count in a table
Open another spark session
insert data into an existing partition
Check the count again in the first session


I expect to see the inserted records.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40429) Only set KeyGroupedPartitioning when the referenced column is in the output

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40429:


Assignee: (was: Apache Spark)

> Only set KeyGroupedPartitioning when the referenced column is in the output
> ---
>
> Key: SPARK-40429
> URL: https://issues.apache.org/jira/browse/SPARK-40429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40429) Only set KeyGroupedPartitioning when the referenced column is in the output

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604961#comment-17604961
 ] 

Apache Spark commented on SPARK-40429:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/37886

> Only set KeyGroupedPartitioning when the referenced column is in the output
> ---
>
> Key: SPARK-40429
> URL: https://issues.apache.org/jira/browse/SPARK-40429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Huaxin Gao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40429) Only set KeyGroupedPartitioning when the referenced column is in the output

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40429:


Assignee: Apache Spark

> Only set KeyGroupedPartitioning when the referenced column is in the output
> ---
>
> Key: SPARK-40429
> URL: https://issues.apache.org/jira/browse/SPARK-40429
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.4.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40429) Only set KeyGroupedPartitioning when the referenced column is in the output

2022-09-14 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-40429:
--

 Summary: Only set KeyGroupedPartitioning when the referenced 
column is in the output
 Key: SPARK-40429
 URL: https://issues.apache.org/jira/browse/SPARK-40429
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0, 3.4.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40428) Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604917#comment-17604917
 ] 

Apache Spark commented on SPARK-40428:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37885

> Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources 
> during abnormal shutdown
> --
>
> Key: SPARK-40428
> URL: https://issues.apache.org/jira/browse/SPARK-40428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Priority: Minor
>
>     Add a shutdown hook in the CoarseGrainedSchedulerBackend to call stop 
> since we've got zombie pods hanging around since the resource tie isn't 
> perfect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40428) Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604916#comment-17604916
 ] 

Apache Spark commented on SPARK-40428:
--

User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37885

> Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources 
> during abnormal shutdown
> --
>
> Key: SPARK-40428
> URL: https://issues.apache.org/jira/browse/SPARK-40428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Priority: Minor
>
>     Add a shutdown hook in the CoarseGrainedSchedulerBackend to call stop 
> since we've got zombie pods hanging around since the resource tie isn't 
> perfect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40428) Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40428:


Assignee: Apache Spark

> Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources 
> during abnormal shutdown
> --
>
> Key: SPARK-40428
> URL: https://issues.apache.org/jira/browse/SPARK-40428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Minor
>
>     Add a shutdown hook in the CoarseGrainedSchedulerBackend to call stop 
> since we've got zombie pods hanging around since the resource tie isn't 
> perfect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40428) Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40428:


Assignee: (was: Apache Spark)

> Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources 
> during abnormal shutdown
> --
>
> Key: SPARK-40428
> URL: https://issues.apache.org/jira/browse/SPARK-40428
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Priority: Minor
>
>     Add a shutdown hook in the CoarseGrainedSchedulerBackend to call stop 
> since we've got zombie pods hanging around since the resource tie isn't 
> perfect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40428) Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown

2022-09-14 Thread Holden Karau (Jira)

Holden Karau created SPARK-40428:


 Summary: Add a shutdownhook to CoarseGrained scheduler to avoid 
dangling resources during abnormal shutdown
 Key: SPARK-40428
 URL: https://issues.apache.org/jira/browse/SPARK-40428
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Spark Core
Affects Versions: 3.4.0
Reporter: Holden Karau


    Add a shutdown hook in the CoarseGrainedSchedulerBackend to call stop since 
we've got zombie pods hanging around since the resource tie isn't perfect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40427) Add error classes for LIMIT/OFFSET CheckAnalysis failures

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40427:


Assignee: Apache Spark

> Add error classes for LIMIT/OFFSET CheckAnalysis failures
> -
>
> Key: SPARK-40427
> URL: https://issues.apache.org/jira/browse/SPARK-40427
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40427) Add error classes for LIMIT/OFFSET CheckAnalysis failures

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40427:


Assignee: (was: Apache Spark)

> Add error classes for LIMIT/OFFSET CheckAnalysis failures
> -
>
> Key: SPARK-40427
> URL: https://issues.apache.org/jira/browse/SPARK-40427
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40427) Add error classes for LIMIT/OFFSET CheckAnalysis failures

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604883#comment-17604883
 ] 

Apache Spark commented on SPARK-40427:
--

User 'dtenedor' has created a pull request for this issue:
https://github.com/apache/spark/pull/37884

> Add error classes for LIMIT/OFFSET CheckAnalysis failures
> -
>
> Key: SPARK-40427
> URL: https://issues.apache.org/jira/browse/SPARK-40427
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40427) Add error classes for LIMIT/OFFSET CheckAnalysis failures

2022-09-14 Thread Daniel (Jira)

Daniel created SPARK-40427:
--

 Summary: Add error classes for LIMIT/OFFSET CheckAnalysis failures
 Key: SPARK-40427
 URL: https://issues.apache.org/jira/browse/SPARK-40427
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38017) Fix the API doc for window to say it supports TimestampNTZType too as timeColumn

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604859#comment-17604859
 ] 

Apache Spark commented on SPARK-38017:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/37883

> Fix the API doc for window to say it supports TimestampNTZType too as 
> timeColumn
> 
>
> Key: SPARK-38017
> URL: https://issues.apache.org/jira/browse/SPARK-38017
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> window function supports not only TimestampType but also TimestampNTZType but 
> the API docs doesn't mention TimestampNTZType.
> This issue is similar to SPARK-38016, but this issue affects 3.2.0 too, so I 
> separate the tickets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38017) Fix the API doc for window to say it supports TimestampNTZType too as timeColumn

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604858#comment-17604858
 ] 

Apache Spark commented on SPARK-38017:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/37882

> Fix the API doc for window to say it supports TimestampNTZType too as 
> timeColumn
> 
>
> Key: SPARK-38017
> URL: https://issues.apache.org/jira/browse/SPARK-38017
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> window function supports not only TimestampType but also TimestampNTZType but 
> the API docs doesn't mention TimestampNTZType.
> This issue is similar to SPARK-38016, but this issue affects 3.2.0 too, so I 
> separate the tickets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38017) Fix the API doc for window to say it supports TimestampNTZType too as timeColumn

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604857#comment-17604857
 ] 

Apache Spark commented on SPARK-38017:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/37882

> Fix the API doc for window to say it supports TimestampNTZType too as 
> timeColumn
> 
>
> Key: SPARK-38017
> URL: https://issues.apache.org/jira/browse/SPARK-38017
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.3.0, 3.2.2
>
>
> window function supports not only TimestampType but also TimestampNTZType but 
> the API docs doesn't mention TimestampNTZType.
> This issue is similar to SPARK-38016, but this issue affects 3.2.0 too, so I 
> separate the tickets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40169) Fix the issue with Parquet column index and predicate pushdown in Data source V1

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40169:


Assignee: Apache Spark

> Fix the issue with Parquet column index and predicate pushdown in Data source 
> V1
> 
>
> Key: SPARK-40169
> URL: https://issues.apache.org/jira/browse/SPARK-40169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Ivan Sadikov
>Assignee: Apache Spark
>Priority: Major
>
> This is a follow for SPARK-39833. In 
> [https://github.com/apache/spark/pull/37419,] we disabled column index for 
> Parquet due to correctness issues that we found when filtering data on the 
> partition column overlapping with data schema.
>  
> This ticket is for permanent and thorough fix for the issue and re-enablement 
> of the column index. See more details in the PR linked above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40169) Fix the issue with Parquet column index and predicate pushdown in Data source V1

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604855#comment-17604855
 ] 

Apache Spark commented on SPARK-40169:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/37881

> Fix the issue with Parquet column index and predicate pushdown in Data source 
> V1
> 
>
> Key: SPARK-40169
> URL: https://issues.apache.org/jira/browse/SPARK-40169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Ivan Sadikov
>Priority: Major
>
> This is a follow for SPARK-39833. In 
> [https://github.com/apache/spark/pull/37419,] we disabled column index for 
> Parquet due to correctness issues that we found when filtering data on the 
> partition column overlapping with data schema.
>  
> This ticket is for permanent and thorough fix for the issue and re-enablement 
> of the column index. See more details in the PR linked above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40169) Fix the issue with Parquet column index and predicate pushdown in Data source V1

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40169:


Assignee: (was: Apache Spark)

> Fix the issue with Parquet column index and predicate pushdown in Data source 
> V1
> 
>
> Key: SPARK-40169
> URL: https://issues.apache.org/jira/browse/SPARK-40169
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.3.1, 3.2.3
>Reporter: Ivan Sadikov
>Priority: Major
>
> This is a follow for SPARK-39833. In 
> [https://github.com/apache/spark/pull/37419,] we disabled column index for 
> Parquet due to correctness issues that we found when filtering data on the 
> partition column overlapping with data schema.
>  
> This ticket is for permanent and thorough fix for the issue and re-enablement 
> of the column index. See more details in the PR linked above.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40334) Implement `GroupBy.prod`.

2022-09-14 Thread Artsiom Yudovin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604841#comment-17604841
 ] 

Artsiom Yudovin commented on SPARK-40334:
-

Got you, thank you so much!

> Implement `GroupBy.prod`.
> -
>
> Key: SPARK-40334
> URL: https://issues.apache.org/jira/browse/SPARK-40334
> Project: Spark
>  Issue Type: Sub-task
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We should implement `GroupBy.prod` for increasing pandas API coverage.
> pandas docs: 
> https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.prod.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-40423) Add explicit YuniKorn queue submission test coverage

2022-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40423:
--
Fix Version/s: 3.3.2
   (was: 3.3.1)

> Add explicit YuniKorn queue submission test coverage
> 
>
> Key: SPARK-40423
> URL: https://issues.apache.org/jira/browse/SPARK-40423
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.4.0, 3.3.2
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-40423) Add explicit YuniKorn queue submission test coverage

2022-09-14 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-40423.
---
Fix Version/s: 3.3.1
   3.4.0
   Resolution: Fixed

Issue resolved by pull request 37877
[https://github.com/apache/spark/pull/37877]

> Add explicit YuniKorn queue submission test coverage
> 
>
> Key: SPARK-40423
> URL: https://issues.apache.org/jira/browse/SPARK-40423
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Tests
>Affects Versions: 3.3.1
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.3.1, 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40426:


Assignee: Apache Spark  (was: Max Gekk)

> Return a map from SparkThrowable.getMessageParameters
> -
>
> Key: SPARK-40426
> URL: https://issues.apache.org/jira/browse/SPARK-40426
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Change the interface SparkThrowable to return a map from 
> getMessageParameters().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-40426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-40426:


Assignee: Max Gekk  (was: Apache Spark)

> Return a map from SparkThrowable.getMessageParameters
> -
>
> Key: SPARK-40426
> URL: https://issues.apache.org/jira/browse/SPARK-40426
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Change the interface SparkThrowable to return a map from 
> getMessageParameters().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604818#comment-17604818
 ] 

Apache Spark commented on SPARK-40426:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37871

> Return a map from SparkThrowable.getMessageParameters
> -
>
> Key: SPARK-40426
> URL: https://issues.apache.org/jira/browse/SPARK-40426
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Change the interface SparkThrowable to return a map from 
> getMessageParameters().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604817#comment-17604817
 ] 

Apache Spark commented on SPARK-40426:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/37871

> Return a map from SparkThrowable.getMessageParameters
> -
>
> Key: SPARK-40426
> URL: https://issues.apache.org/jira/browse/SPARK-40426
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Change the interface SparkThrowable to return a map from 
> getMessageParameters().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-40426) Return a map from SparkThrowable.getMessageParameters

2022-09-14 Thread Max Gekk (Jira)

Max Gekk created SPARK-40426:


 Summary: Return a map from SparkThrowable.getMessageParameters
 Key: SPARK-40426
 URL: https://issues.apache.org/jira/browse/SPARK-40426
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Max Gekk
Assignee: Max Gekk


Change the interface SparkThrowable to return a map from getMessageParameters().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39399) proxy-user not working for Spark on k8s in cluster deploy mode

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39399:


Assignee: (was: Apache Spark)

> proxy-user not working for Spark on k8s in cluster deploy mode
> --
>
> Key: SPARK-39399
> URL: https://issues.apache.org/jira/browse/SPARK-39399
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant Prasad
>Priority: Major
>
> As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user 
> support was added for Spark on K8s. But the PR only added proxy user argument 
> on the spark-submit command. The actual functionality of authentication using 
> the proxy user is not working in case of cluster deploy mode.
> We get AccessControlException when trying to access the kerberized HDFS 
> through a proxy user. 
> Spark-Submit:
> $SPARK_HOME/bin/spark-submit \
> --master  \
> --deploy-mode cluster \
> --name with_proxy_user_di \
> --proxy-user  \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.kubernetes.container.image= \
> --conf spark.kubernetes.driver.limit.cores=1 \
> --conf spark.executor.instances=1 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.namespace= \
> --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
> --conf spark.eventLog.enabled=true \
> --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \
> --conf spark.kubernetes.file.upload.path=hdfs:///tmp \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar 
> Driver Logs:
> {code:java}
> ++ id -u
> + myuid=185
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 185
> + uidentry=
> + set -e
> + '[' -z '' ']'
> + '[' -w /etc/passwd ']'
> + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -z ']'
> + '[' -z ']'
> + '[' -n '' ']'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*'
> + case "$1" in
> + shift 1
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.SparkPi spark-internal
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor 
> java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful 
> kerberos logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos 
> logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, 
> valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private 
> org.apache.hadoop.metrics2.lib.MutableGaugeLong 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal
>  with annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Renewal failures since 
> startup"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private

[jira] [Assigned] (SPARK-39399) proxy-user not working for Spark on k8s in cluster deploy mode

2022-09-14 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39399:


Assignee: Apache Spark

> proxy-user not working for Spark on k8s in cluster deploy mode
> --
>
> Key: SPARK-39399
> URL: https://issues.apache.org/jira/browse/SPARK-39399
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant Prasad
>Assignee: Apache Spark
>Priority: Major
>
> As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user 
> support was added for Spark on K8s. But the PR only added proxy user argument 
> on the spark-submit command. The actual functionality of authentication using 
> the proxy user is not working in case of cluster deploy mode.
> We get AccessControlException when trying to access the kerberized HDFS 
> through a proxy user. 
> Spark-Submit:
> $SPARK_HOME/bin/spark-submit \
> --master  \
> --deploy-mode cluster \
> --name with_proxy_user_di \
> --proxy-user  \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.kubernetes.container.image= \
> --conf spark.kubernetes.driver.limit.cores=1 \
> --conf spark.executor.instances=1 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.namespace= \
> --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
> --conf spark.eventLog.enabled=true \
> --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \
> --conf spark.kubernetes.file.upload.path=hdfs:///tmp \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar 
> Driver Logs:
> {code:java}
> ++ id -u
> + myuid=185
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 185
> + uidentry=
> + set -e
> + '[' -z '' ']'
> + '[' -w /etc/passwd ']'
> + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -z ']'
> + '[' -z ']'
> + '[' -n '' ']'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*'
> + case "$1" in
> + shift 1
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.SparkPi spark-internal
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor 
> java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful 
> kerberos logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos 
> logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, 
> valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private 
> org.apache.hadoop.metrics2.lib.MutableGaugeLong 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal
>  with annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Renewal failures since 
> startup"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG

[jira] [Commented] (SPARK-39399) proxy-user not working for Spark on k8s in cluster deploy mode

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604810#comment-17604810
 ] 

Apache Spark commented on SPARK-39399:
--

User 'shrprasa' has created a pull request for this issue:
https://github.com/apache/spark/pull/37880

> proxy-user not working for Spark on k8s in cluster deploy mode
> --
>
> Key: SPARK-39399
> URL: https://issues.apache.org/jira/browse/SPARK-39399
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant Prasad
>Priority: Major
>
> As part of https://issues.apache.org/jira/browse/SPARK-25355 Proxy user 
> support was added for Spark on K8s. But the PR only added proxy user argument 
> on the spark-submit command. The actual functionality of authentication using 
> the proxy user is not working in case of cluster deploy mode.
> We get AccessControlException when trying to access the kerberized HDFS 
> through a proxy user. 
> Spark-Submit:
> $SPARK_HOME/bin/spark-submit \
> --master  \
> --deploy-mode cluster \
> --name with_proxy_user_di \
> --proxy-user  \
> --class org.apache.spark.examples.SparkPi \
> --conf spark.kubernetes.container.image= \
> --conf spark.kubernetes.driver.limit.cores=1 \
> --conf spark.executor.instances=1 \
> --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
> --conf spark.kubernetes.namespace= \
> --conf spark.kubernetes.kerberos.krb5.path=/etc/krb5.conf \
> --conf spark.eventLog.enabled=true \
> --conf spark.eventLog.dir=hdfs:///scaas/shs_logs \
> --conf spark.kubernetes.file.upload.path=hdfs:///tmp \
> --conf spark.kubernetes.container.image.pullPolicy=Always \
> $SPARK_HOME/examples/jars/spark-examples_2.12-3.2.0-1.jar 
> Driver Logs:
> {code:java}
> ++ id -u
> + myuid=185
> ++ id -g
> + mygid=0
> + set +e
> ++ getent passwd 185
> + uidentry=
> + set -e
> + '[' -z '' ']'
> + '[' -w /etc/passwd ']'
> + echo '185:x:185:0:anonymous uid:/opt/spark:/bin/false'
> + SPARK_CLASSPATH=':/opt/spark/jars/*'
> + env
> + grep SPARK_JAVA_OPT_
> + sort -t_ -k4 -n
> + sed 's/[^=]*=\(.*\)/\1/g'
> + readarray -t SPARK_EXECUTOR_JAVA_OPTS
> + '[' -n '' ']'
> + '[' -z ']'
> + '[' -z ']'
> + '[' -n '' ']'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/hadoop/conf::/opt/spark/jars/*'
> + '[' -z x ']'
> + SPARK_CLASSPATH='/opt/spark/conf:/opt/hadoop/conf::/opt/spark/jars/*'
> + case "$1" in
> + shift 1
> + CMD=("$SPARK_HOME/bin/spark-submit" --conf 
> "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client 
> "$@")
> + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf 
> spark.driver.bindAddress= --deploy-mode client --proxy-user proxy_user 
> --properties-file /opt/spark/conf/spark.properties --class 
> org.apache.spark.examples.SparkPi spark-internal
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/spark/jars/spark-unsafe_2.12-3.2.0-1.jar) to constructor 
> java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of successful 
> kerberos logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Rate of failed kerberos 
> logins and latency (milliseconds)"}, valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field 
> org.apache.hadoop.metrics2.lib.MutableRate 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with 
> annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"GetGroups"}, 
> valueName="Time")
> 22/04/26 08:54:38 DEBUG MutableMetricsFactory: field private 
> org.apache.hadoop.metrics2.lib.MutableGaugeLong 
> org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal
>  with annotation @org.apache.hadoop.metrics2.annotation.Metric(about="", 
> sampleName="Ops", always=false, type=DEFAULT, value={"Renewal failures

[jira] [Commented] (SPARK-40425) DROP TABLE does not need to do table lookup

2022-09-14 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17604777#comment-17604777
 ] 

Apache Spark commented on SPARK-40425:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/37879

> DROP TABLE does not need to do table lookup
> ---
>
> Key: SPARK-40425
> URL: https://issues.apache.org/jira/browse/SPARK-40425
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 130 matches

Mail list logo