date:20220429

[jira] [Assigned] (SPARK-37938) Use error classes in the parsing errors of partitions

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37938:


Assignee: (was: Apache Spark)

> Use error classes in the parsing errors of partitions
> -
>
> Key: SPARK-37938
> URL: https://issues.apache.org/jira/browse/SPARK-37938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * emptyPartitionKeyError
> * partitionTransformNotExpectedError
> * descColumnForPartitionUnsupportedError
> * incompletePartitionSpecificationError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37938) Use error classes in the parsing errors of partitions

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-37938:


Assignee: Apache Spark

> Use error classes in the parsing errors of partitions
> -
>
> Key: SPARK-37938
> URL: https://issues.apache.org/jira/browse/SPARK-37938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * emptyPartitionKeyError
> * partitionTransformNotExpectedError
> * descColumnForPartitionUnsupportedError
> * incompletePartitionSpecificationError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37938) Use error classes in the parsing errors of partitions

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530322#comment-17530322
 ] 

Apache Spark commented on SPARK-37938:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36416

> Use error classes in the parsing errors of partitions
> -
>
> Key: SPARK-37938
> URL: https://issues.apache.org/jira/browse/SPARK-37938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * emptyPartitionKeyError
> * partitionTransformNotExpectedError
> * descColumnForPartitionUnsupportedError
> * incompletePartitionSpecificationError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37938) Use error classes in the parsing errors of partitions

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-37938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530320#comment-17530320
 ] 

Apache Spark commented on SPARK-37938:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36416

> Use error classes in the parsing errors of partitions
> -
>
> Key: SPARK-37938
> URL: https://issues.apache.org/jira/browse/SPARK-37938
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Migrate the following errors in QueryParsingErrors:
> * emptyPartitionKeyError
> * partitionTransformNotExpectedError
> * descColumnForPartitionUnsupportedError
> * incompletePartitionSpecificationError
> onto use error classes. Throw an implementation of SparkThrowable. Also write 
> a test per every error in QueryParsingErrorsSuite.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39048) Refactor `GroupBy._reduce_for_stat_function` on accepted data types

2022-04-29 Thread Takuya Ueshin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-39048.
---
  Assignee: Xinrong Meng
Resolution: Fixed

Issue resolved by pull request 36382
https://github.com/apache/spark/pull/36382

> Refactor `GroupBy._reduce_for_stat_function` on accepted data types 
> 
>
> Key: SPARK-39048
> URL: https://issues.apache.org/jira/browse/SPARK-39048
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> `Groupby._reduce_for_stat_function` is a common helper function leveraged by 
> multiple statistical functions of GroupBy objects.
> It defines parameters `only_numeric` and `bool_as_numeric` to control 
> accepted Spark types.
> To be consistent with pandas API, we may also have to introduce 
> `str_as_numeric` for `sum` for example.
> Instead of introducing parameters designated for each Spark type, the PR is 
> proposed to introduce a parameter `accepted_spark_types` to specify accepted 
> types of Spark columns to be aggregated.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39078) Support UPDATE commands with DEFAULT values

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39078:


Assignee: Apache Spark

> Support UPDATE commands with DEFAULT values
> ---
>
> Key: SPARK-39078
> URL: https://issues.apache.org/jira/browse/SPARK-39078
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39078) Support UPDATE commands with DEFAULT values

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530276#comment-17530276
 ] 

Apache Spark commented on SPARK-39078:
--

User 'dtenedor' has created a pull request for this issue:
https://github.com/apache/spark/pull/36415

> Support UPDATE commands with DEFAULT values
> ---
>
> Key: SPARK-39078
> URL: https://issues.apache.org/jira/browse/SPARK-39078
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39078) Support UPDATE commands with DEFAULT values

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39078:


Assignee: (was: Apache Spark)

> Support UPDATE commands with DEFAULT values
> ---
>
> Key: SPARK-39078
> URL: https://issues.apache.org/jira/browse/SPARK-39078
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39078) Support UPDATE commands with DEFAULT values

2022-04-29 Thread Daniel (Jira)

Daniel created SPARK-39078:
--

 Summary: Support UPDATE commands with DEFAULT values
 Key: SPARK-39078
 URL: https://issues.apache.org/jira/browse/SPARK-39078
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Daniel






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39077) Implement `skipna` of basic statistical functions of DataFrame and Series

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39077:


Assignee: Apache Spark

> Implement `skipna` of basic statistical functions of DataFrame and Series
> -
>
> Key: SPARK-39077
> URL: https://issues.apache.org/jira/browse/SPARK-39077
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Implement `skipna` of basic statistical functions of DataFrame and Series



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39077) Implement `skipna` of basic statistical functions of DataFrame and Series

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39077:


Assignee: (was: Apache Spark)

> Implement `skipna` of basic statistical functions of DataFrame and Series
> -
>
> Key: SPARK-39077
> URL: https://issues.apache.org/jira/browse/SPARK-39077
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `skipna` of basic statistical functions of DataFrame and Series



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39077) Implement `skipna` of basic statistical functions of DataFrame and Series

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530275#comment-17530275
 ] 

Apache Spark commented on SPARK-39077:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/36414

> Implement `skipna` of basic statistical functions of DataFrame and Series
> -
>
> Key: SPARK-39077
> URL: https://issues.apache.org/jira/browse/SPARK-39077
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Implement `skipna` of basic statistical functions of DataFrame and Series



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39077) Implement `skipna` of basic statistical functions of DataFrame and Series

2022-04-29 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-39077:


 Summary: Implement `skipna` of basic statistical functions of 
DataFrame and Series
 Key: SPARK-39077
 URL: https://issues.apache.org/jira/browse/SPARK-39077
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Implement `skipna` of basic statistical functions of DataFrame and Series



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39076) Standardize Statistical Functions of pandas API on Spark

2022-04-29 Thread Xinrong Meng (Jira)

Xinrong Meng created SPARK-39076:


 Summary: Standardize Statistical Functions of pandas API on Spark
 Key: SPARK-39076
 URL: https://issues.apache.org/jira/browse/SPARK-39076
 Project: Spark
  Issue Type: Umbrella
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Xinrong Meng


Statistical functions are the most commonly-used functions in Data Engineering 
and Data Analysis.

Spark and pandas provide statistical functions in the context of SQL and Data 
Science separately.

pandas API on Spark implements the pandas API on top of Apache Spark. Although 
there may be semantic differences of certain functions due to the high cost of 
big data calculations, for example, median. We should still try to reach the 
parity from the API level.

However, critical parameters, such as `skipna`,  of statistical functions are 
missing of basic objects: DataFrame, Series, and Index are missing. 

There is even a larger gap between statistical functions of pandas-on-Spark 
GroupBy objects and those of pandas GroupBy objects. In addition, tests 
coverage is far from perfect.

With statistical functions standardized, pandas API coverage will be increased 
since missing parameters will be implemented. That would further improve the 
user adoption.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39012) SparkSQL parse partition value does not support all data types

2022-04-29 Thread Rui Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rui Wang updated SPARK-39012:
-
Summary: SparkSQL parse partition value does not support all data types  
(was: SparkSQL infer schema does not support all data types)

> SparkSQL parse partition value does not support all data types
> --
>
> Key: SPARK-39012
> URL: https://issues.apache.org/jira/browse/SPARK-39012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Rui Wang
>Priority: Major
>
> When Spark needs to infer schema, it needs to parse string to a type. Not all 
> data types are supported so far in this path. For example, binary is known to 
> not be supported. If a user uses binary column, and if the user does not use 
> a metastore, then SparkSQL could fall back to schema inference thus fail to 
> execute during table scan. This should be a bug as schema inference is 
> supported but some types are missing.
> string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also 
> because when converting from a string, small scale type won't be identified 
> if there is a larger scale type. For example, short and long 
> Based on Spark SQL data types: 
> https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support 
> the following types:
> BINARY
> BOOLEAN
> And there are two types that I am not sure if SparkSQL is supporting:
> YearMonthIntervalType
> DayTimeIntervalType



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530185#comment-17530185
 ] 

Erik Krogen commented on SPARK-39075:
-

I'm not sure of the history here, but this might be intentional.

In AvroSerializer, going from a Catalyst {{ShortType}} to an Avro INT is an 
upcast, and therefore safe (a short will always fit inside of an int). But in 
AvroDeserializer, going from an Avro INT to a Catalyst {{ShortType}} is a 
downcast, so it's not a safe conversion.

We can see that similarly, we don't support double-to-float or long-to-int 
(though we also don't support float-to-double or int-to-long, for that matter).

I personally wouldn't have a concern with adding the int-to-short/byte 
conversion as long as overflow detection is handled gracefully.

cc [~cloud_fan] [~Gengliang.Wang]

> IncompatibleSchemaException when selecting data from table stored from a 
> DataFrame in Avro format with BYTE/SHORT
> -
>
> Key: SPARK-39075
> URL: https://issues.apache.org/jira/browse/SPARK-39075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xsys
>Priority: Major
>
> h3. Describe the bug
> We are trying to save a table constructed through a DataFrame with the 
> {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as 
> part of the schema.
> When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
> the table, we expect it to give back the inserted value. However, we instead 
> get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.
> This appears to be caused by a missing case statement handling the {{(INT, 
> ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
> newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
> h3. To Reproduce
> On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the 
> Avro package:
> {code:java}
> ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
> Execute the following:
> {code:java}
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types._
> val schema = new StructType().add(StructField("c1", ShortType, true))
> val rdd = sc.parallelize(Seq(Row("-128".toShort)))
> val df = spark.createDataFrame(rdd, schema)
> df.write.mode("overwrite").format("avro").saveAsTable("t0")
> spark.sql("select * from t0;").show(false){code}
> Resulting error:
> {code:java}
> 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
> org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro 
> type {"type":"record","name":"topLevelRecord","fields":[
> {"name":"c1","type":["int","null"]}
> ]} to SQL type STRUCT<`c1`: SMALLINT>. 
> at 
> org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
>  
> at 
> org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
>  
> at 
> org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
>  
> at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
>  
> at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>  
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
> at org.apache.spark.sch

[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread Erik Krogen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-39075:

Description: 
h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false){code}
Resulting error:
{code:java}
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields":[
{"name":"c1","type":["int","null"]}
]} to SQL type STRUCT<`c1`: SMALLINT>. 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
 
at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) 
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Avro field 'c1' to SQL field 'c1' because schema is incompatible 
(avroType = "int", sqlType = SMALLINT) 
at 
org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321)
at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84)
... 26 more
{code}
h3. Expected behavior & Possible Solution

We expect the output to successfully select {{-128}}. We tried other formats 
like Parquet and the outcome is consistent with this expectation.

In the [{{AvroSerializer 
newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sq

[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39075:
-
Description: 
h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false){code}
Resulting error:
{code:java}
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields":[
{"name":"c1","type":["int","null"]}
]} to SQL type STRUCT<`c1`: SMALLINT>. 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
 
at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) 
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Avro field 'c1' to SQL field 'c1' because schema is incompatible 
(avroType = "int", sqlType = SMALLINT) 
at 
org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321)
at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84)
... 26 more
{code}
h3. Expected behavior & Possible Solution

We expect the output to successfully select {{{}-128{}}}. We tried other 
formats like Parquet and the outcome is consistent with this expectation.

In the [{{AvroSerializer 
newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/Avr

[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39075:
-
Description: 
h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
{code:java}
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code}
Execute the following:
{code:java}
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false){code}
Resulting error:
{code:java}
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields":[
{"name":"c1","type":["int","null"]}
]} to SQL type STRUCT<`c1`: SMALLINT>. 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
 
at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) 
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Avro field 'c1' to SQL field 'c1' because schema is incompatible 
(avroType = "int", sqlType = SMALLINT) 
at 
org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321)
at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84)
... 26 more
{code}
 
h3. Expected behavior & Possible Solution

We expect the output to successfully select {{{}-128{}}}. We tried other 
formats like Parquet and the outcome is consistent with this expectation.

In the [{{AvroSerializer 
newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/A

[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39075:
-
Description: 
h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}}
Execute the following:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false)\{{}}
Resulting error:
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields":[

{"name":"c1","type":["int","null"]}

]} to SQL type STRUCT<`c1`: SMALLINT>. 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
 
at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) 
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Avro field 'c1' to SQL field 'c1' because schema is incompatible 
(avroType = "int", sqlType = SMALLINT) 
at 
org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321)
at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84)
... 26 more\{{}}
h3. Expected behavior & Possible Solution

We expect the output to successfully select {{{}-128{}}}. We tried other 
formats like Parquet and the outcome is consistent with this expectation.

In the [{{AvroSerializer 
newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114]{{{},
 {{ByteTy

[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39075:
-
Description: 
h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}}
Execute the following:
import org.apache.spark.sql.\{Row, SparkSession}
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false)\{{}}
Resulting error:
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields":[

{"name":"c1","type":["int","null"]}

]} to SQL type STRUCT<`c1`: SMALLINT>. 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
 
at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) 
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Avro field 'c1' to SQL field 'c1' because schema is incompatible 
(avroType = "int", sqlType = SMALLINT) 
at 
org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321)
at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84)
... 26 more\{{}}
h3. Expected behavior & Possible Solution

We expect the output to successfully select {{{}-128{}}}. We tried other 
formats like Parquet and the outcome is consistent with this expectation.

In the [{{AvroSerializer 
newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L11

[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xsys updated SPARK-39075:
-
Description: 
h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer 
newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}}
Execute the following:
import org.apache.spark.sql.\{Row, SparkSession}
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false)\{{}}
Resulting error:
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) 
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
{"type":"record","name":"topLevelRecord","fields":[

{"name":"c1","type":["int","null"]}

]} to SQL type STRUCT<`c1`: SMALLINT>. 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)
 
at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
 
at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) 
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) 
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) 
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) 
at org.apache.spark.scheduler.Task.run(Task.scala:131) 
at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
 
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) 
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) 
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748) 
Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot 
convert Avro field 'c1' to SQL field 'c1' because schema is incompatible 
(avroType = "int", sqlType = SMALLINT) 
at 
org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321)
at 
org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84)
... 26 more\{{}}
h3. Expected behavior & Possible Solution

We expect the output to successfully select {{{}-128{}}}. We tried other 
formats like Parquet and the outcome is consistent with this expectation.

In the [{{AvroSerializer 
newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L11

[jira] [Created] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT

2022-04-29 Thread xsys (Jira)

xsys created SPARK-39075:


 Summary: IncompatibleSchemaException when selecting data from 
table stored from a DataFrame in Avro format with BYTE/SHORT
 Key: SPARK-39075
 URL: https://issues.apache.org/jira/browse/SPARK-39075
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: xsys


h3. Describe the bug

We are trying to save a table constructed through a DataFrame with the {{Avro}} 
data format. The table contains {{ByteType}} or {{ShortType}} as part of the 
schema.

When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from 
the table, we expect it to give back the inserted value. However, we instead 
get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}.

This appears to be caused by a missing case statement handling the {{(INT, 
ShortType)}} and {{(INT, ByteType)}} cases in [{{{}AvroDeserializer 
newWriter{}}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321][{{}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321].
h3. To Reproduce

On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro 
package:
./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{{}}
Execute the following:
import org.apache.spark.sql.\{Row, SparkSession}
import org.apache.spark.sql.types._
val schema = new StructType().add(StructField("c1", ShortType, true))
val rdd = sc.parallelize(Seq(Row("-128".toShort)))
val df = spark.createDataFrame(rdd, schema)
df.write.mode("overwrite").format("avro").saveAsTable("t0")
spark.sql("select * from t0;").show(false){{}}
Resulting error:
22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32)   
   
org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type 
\{"type":"record","name":"topLevelRecord","fields":[{"name":"c1","type":["int","null"]}]}
 to SQL type STRUCT<`c1`: SMALLINT>.
at 
org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102)
 
at 
org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74)
 
at 
org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143)

at 
org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136)

at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)

at 
org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133)

at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)

at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187)
 
at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104)
 
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)   

at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
   
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
 
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)  

at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) 

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)   

at org.apache.spark.scheduler.Task.run(Task.scala:131)

[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-04-29 Thread pralabhkumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530083#comment-17530083
 ] 

pralabhkumar commented on SPARK-25355:
--

[~hyukjin.kwon]  Can u please us  or redirect us to someone  who can help us on 
the above two comments . 

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530066#comment-17530066
 ] 

Apache Spark commented on SPARK-39074:
--

User 'EnricoMi' has created a pull request for this issue:
https://github.com/apache/spark/pull/36413

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Major
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39074:


Assignee: Apache Spark

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Assignee: Apache Spark
>Priority: Major
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39074:


Assignee: (was: Apache Spark)

> Fail on uploading test files, not when downloading them
> ---
>
> Key: SPARK-39074
> URL: https://issues.apache.org/jira/browse/SPARK-39074
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Major
>
> The CI workflow "Report test results" fails when there are no artifacts to be 
> downloaded from the triggering workflow. In some situations, the triggering 
> workflow is not skipped, but all test jobs are skipped in case no code 
> changes are detected.
> In that situation, no test files are uploaded, which makes the triggered 
> workflow fail.
> Downloading no test files can have two reasons:
> 1. No tests have been executed or no test files have been generated.
> 2. No code has been built and tested deliberately.
> You want to be notified in the first situation to fix the CI. Therefore, CI 
> should fail when code is built and tests are run but no test result files are 
> been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39074) Fail on uploading test files, not when downloading them

2022-04-29 Thread Enrico Minack (Jira)

Enrico Minack created SPARK-39074:
-

 Summary: Fail on uploading test files, not when downloading them
 Key: SPARK-39074
 URL: https://issues.apache.org/jira/browse/SPARK-39074
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: Enrico Minack


The CI workflow "Report test results" fails when there are no artifacts to be 
downloaded from the triggering workflow. In some situations, the triggering 
workflow is not skipped, but all test jobs are skipped in case no code changes 
are detected.

In that situation, no test files are uploaded, which makes the triggered 
workflow fail.

Downloading no test files can have two reasons:
1. No tests have been executed or no test files have been generated.
2. No code has been built and tested deliberately.

You want to be notified in the first situation to fix the CI. Therefore, CI 
should fail when code is built and tests are run but no test result files are 
been found.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39073) Keep rowCount after hive table partition pruning if table only have hive statistics

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530055#comment-17530055
 ] 

Apache Spark commented on SPARK-39073:
--

User 'qiuliang988' has created a pull request for this issue:
https://github.com/apache/spark/pull/36412

> Keep rowCount after hive table partition pruning if table only have hive 
> statistics
> ---
>
> Key: SPARK-39073
> URL: https://issues.apache.org/jira/browse/SPARK-39073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: qiuliang
>Priority: Major
>
> If the partitioned table only has hive generated statistics, and the 
> statistics are stored in partition properties. HiveTableRelation cannot 
> obtain rowCount because the table statistics do not exist. We can generate 
> rowCount based on the statistics of these pruned partitions



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39073) Keep rowCount after hive table partition pruning if table only have hive statistics

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39073:


Assignee: (was: Apache Spark)

> Keep rowCount after hive table partition pruning if table only have hive 
> statistics
> ---
>
> Key: SPARK-39073
> URL: https://issues.apache.org/jira/browse/SPARK-39073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: qiuliang
>Priority: Major
>
> If the partitioned table only has hive generated statistics, and the 
> statistics are stored in partition properties. HiveTableRelation cannot 
> obtain rowCount because the table statistics do not exist. We can generate 
> rowCount based on the statistics of these pruned partitions



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39073) Keep rowCount after hive table partition pruning if table only have hive statistics

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39073:


Assignee: Apache Spark

> Keep rowCount after hive table partition pruning if table only have hive 
> statistics
> ---
>
> Key: SPARK-39073
> URL: https://issues.apache.org/jira/browse/SPARK-39073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: qiuliang
>Assignee: Apache Spark
>Priority: Major
>
> If the partitioned table only has hive generated statistics, and the 
> statistics are stored in partition properties. HiveTableRelation cannot 
> obtain rowCount because the table statistics do not exist. We can generate 
> rowCount based on the statistics of these pruned partitions



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39073) Keep rowCount after hive table partition pruning if table only have hive statistics

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530054#comment-17530054
 ] 

Apache Spark commented on SPARK-39073:
--

User 'qiuliang988' has created a pull request for this issue:
https://github.com/apache/spark/pull/36412

> Keep rowCount after hive table partition pruning if table only have hive 
> statistics
> ---
>
> Key: SPARK-39073
> URL: https://issues.apache.org/jira/browse/SPARK-39073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: qiuliang
>Priority: Major
>
> If the partitioned table only has hive generated statistics, and the 
> statistics are stored in partition properties. HiveTableRelation cannot 
> obtain rowCount because the table statistics do not exist. We can generate 
> rowCount based on the statistics of these pruned partitions



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-04-29 Thread Shrikant (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530046#comment-17530046
 ] 

Shrikant edited comment on SPARK-25355 at 4/29/22 3:13 PM:
---

[~pedro.rossi] [~dongjoon]  When we use the proxy-user parameter, access to 
Kerberized HDFS is not working for Spark on Kubernetes. It's because the proxy 
user doesn't have access to any delegation tokens. Was this functionality 
tested for Spark on Kubernetes when  proxy-user support was added as part of 
this task?


was (Author: JIRAUSER280449):
[~pedro.rossi] [~dongjoon]  When we use the proxy-user parameter, access to 
Kerberized HDFS is not working for Spark on Kubernetes. It's because the proxy 
user doesn't have access to any delegation tokens. Was this functionality 
tested for Spark on Kubernetes when this bug was fixed?

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s

2022-04-29 Thread Shrikant (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530046#comment-17530046
 ] 

Shrikant commented on SPARK-25355:
--

[~pedro.rossi] [~dongjoon]  When we use the proxy-user parameter, access to 
Kerberized HDFS is not working for Spark on Kubernetes. It's because the proxy 
user doesn't have access to any delegation tokens. Was this functionality 
tested for Spark on Kubernetes when this bug was fixed?

> Support --proxy-user for Spark on K8s
> -
>
> Key: SPARK-25355
> URL: https://issues.apache.org/jira/browse/SPARK-25355
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.1.0
>Reporter: Stavros Kontopoulos
>Assignee: Pedro Rossi
>Priority: Major
> Fix For: 3.1.0
>
>
> SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition 
> needed is the support for proxy user. A proxy user is impersonated by a 
> superuser who executes operations on behalf of the proxy user. More on this: 
> [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html]
> [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md]
> This has been implemented for Yarn upstream and Spark on Mesos here:
> [https://github.com/mesosphere/spark/pull/26]
> [~ifilonenko] creating this issue according to our discussion.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39073) Keep rowCount after hive table partition pruning if table only have hive statistics

2022-04-29 Thread qiuliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qiuliang updated SPARK-39073:
-
Summary: Keep rowCount after hive table partition pruning if table only 
have hive statistics  (was: keep rowCount after hive table partition pruning if 
table only have hive statistics)

> Keep rowCount after hive table partition pruning if table only have hive 
> statistics
> ---
>
> Key: SPARK-39073
> URL: https://issues.apache.org/jira/browse/SPARK-39073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: qiuliang
>Priority: Major
>
> If the partitioned table only has hive generated statistics, and the 
> statistics are stored in partition properties. HiveTableRelation cannot 
> obtain rowCount because the table statistics do not exist. We can generate 
> rowCount based on the statistics of these pruned partitions



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39073) keep rowCount after hive table partition pruning if table only have hive statistics

2022-04-29 Thread qiuliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qiuliang updated SPARK-39073:
-
Summary: keep rowCount after hive table partition pruning if table only 
have hive statistics  (was: keep rowCount after hive table partition pruning if 
only have hive statistics)

> keep rowCount after hive table partition pruning if table only have hive 
> statistics
> ---
>
> Key: SPARK-39073
> URL: https://issues.apache.org/jira/browse/SPARK-39073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: qiuliang
>Priority: Major
>
> If the partitioned table only has hive generated statistics, and the 
> statistics are stored in partition properties. HiveTableRelation cannot 
> obtain rowCount because the table statistics do not exist. We can generate 
> rowCount based on the statistics of these pruned partitions



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39073) keep rowCount after hive table partition pruning if only have hive statistics

2022-04-29 Thread qiuliang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

qiuliang updated SPARK-39073:
-
Issue Type: Improvement  (was: Bug)

> keep rowCount after hive table partition pruning if only have hive statistics
> -
>
> Key: SPARK-39073
> URL: https://issues.apache.org/jira/browse/SPARK-39073
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: qiuliang
>Priority: Major
>
> If the partitioned table only has hive generated statistics, and the 
> statistics are stored in partition properties. HiveTableRelation cannot 
> obtain rowCount because the table statistics do not exist. We can generate 
> rowCount based on the statistics of these pruned partitions



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39073) keep rowCount after hive table partition pruning if only have hive statistics

2022-04-29 Thread qiuliang (Jira)

qiuliang created SPARK-39073:


 Summary: keep rowCount after hive table partition pruning if only 
have hive statistics
 Key: SPARK-39073
 URL: https://issues.apache.org/jira/browse/SPARK-39073
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1
Reporter: qiuliang


If the partitioned table only has hive generated statistics, and the statistics 
are stored in partition properties. HiveTableRelation cannot obtain rowCount 
because the table statistics do not exist. We can generate rowCount based on 
the statistics of these pruned partitions



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39072) Fast Fail the remaining push blocks if shuffle stage finalized

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529992#comment-17529992
 ] 

Apache Spark commented on SPARK-39072:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/36411

> Fast Fail the remaining push blocks if shuffle stage finalized
> --
>
> Key: SPARK-39072
> URL: https://issues.apache.org/jira/browse/SPARK-39072
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Minor
>
> Map task will try to push all map outputs to external shuffle service now.
> After the shuffle stage is finalized, the reduce fetch blocks RPC will be 
> blocked if there are still many map output blocks in flight.
> We could stop pushing the remaining blocks if the shuffle stage is finalized.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39072) Fast Fail the remaining push blocks if shuffle stage finalized

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39072:


Assignee: (was: Apache Spark)

> Fast Fail the remaining push blocks if shuffle stage finalized
> --
>
> Key: SPARK-39072
> URL: https://issues.apache.org/jira/browse/SPARK-39072
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Minor
>
> Map task will try to push all map outputs to external shuffle service now.
> After the shuffle stage is finalized, the reduce fetch blocks RPC will be 
> blocked if there are still many map output blocks in flight.
> We could stop pushing the remaining blocks if the shuffle stage is finalized.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39072) Fast Fail the remaining push blocks if shuffle stage finalized

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529990#comment-17529990
 ] 

Apache Spark commented on SPARK-39072:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/36411

> Fast Fail the remaining push blocks if shuffle stage finalized
> --
>
> Key: SPARK-39072
> URL: https://issues.apache.org/jira/browse/SPARK-39072
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Minor
>
> Map task will try to push all map outputs to external shuffle service now.
> After the shuffle stage is finalized, the reduce fetch blocks RPC will be 
> blocked if there are still many map output blocks in flight.
> We could stop pushing the remaining blocks if the shuffle stage is finalized.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39072) Fast Fail the remaining push blocks if shuffle stage finalized

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39072:


Assignee: Apache Spark

> Fast Fail the remaining push blocks if shuffle stage finalized
> --
>
> Key: SPARK-39072
> URL: https://issues.apache.org/jira/browse/SPARK-39072
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Minor
>
> Map task will try to push all map outputs to external shuffle service now.
> After the shuffle stage is finalized, the reduce fetch blocks RPC will be 
> blocked if there are still many map output blocks in flight.
> We could stop pushing the remaining blocks if the shuffle stage is finalized.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39069) Simplify another conditionals case in predicate

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39069:


Assignee: (was: Apache Spark)

> Simplify another conditionals case in predicate
> ---
>
> Key: SPARK-39069
> URL: https://issues.apache.org/jira/browse/SPARK-39069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> sql(
>   """
> |CREATE TABLE t1 (
> |  id DECIMAL(18,0),
> |  event_dt DATE,
> |  cmpgn_run_dt DATE)
> |USING parquet
> |PARTITIONED BY (cmpgn_run_dt)
>   """.stripMargin)
> sql(
>   """
> |select count(*)
> |from t1
> |where CMPGN_RUN_DT >= date_sub(EVENT_DT,2) and CMPGN_RUN_DT <= EVENT_DT
> |and EVENT_DT ='2022-04-05'
> |;
>   """.stripMargin).explain(true)
> {code}
> Excepted:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count(1)#4L]
> +- Project
>+- Filter (((isnotnull(CMPGN_RUN_DT#3) AND (CMPGN_RUN_DT#3 >= 2022-04-03)) 
> AND (CMPGN_RUN_DT#3 <= 2022-04-05)) AND (EVENT_DT#2 = 2022-04-05))
>   +- Relation default.t1[id#1,event_dt#2,cmpgn_run_dt#3] parquet
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#4L])
> +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#31]
>+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#7L])
>   +- *(1) Project
>  +- *(1) Filter (EVENT_DT#2 = 2022-04-05)
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[event_dt#2,cmpgn_run_dt#3] 
> Batched: true, DataFilters: [(event_dt#2 = 2022-04-05)], Format: Parquet, 
> Location: InMemoryFileIndex[], PartitionFilters: [isnotnull(cmpgn_run_dt#3), 
> (cmpgn_run_dt#3 >= 2022-04-03), (cmpgn_run_dt#3 <= 2022-04-05)], 
> PushedFilters: [EqualTo(event_dt,2022-04-05)], ReadSchema: 
> struct, UsedIndexes: []
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39069) Simplify another conditionals case in predicate

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39069:


Assignee: Apache Spark

> Simplify another conditionals case in predicate
> ---
>
> Key: SPARK-39069
> URL: https://issues.apache.org/jira/browse/SPARK-39069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {code:scala}
> sql(
>   """
> |CREATE TABLE t1 (
> |  id DECIMAL(18,0),
> |  event_dt DATE,
> |  cmpgn_run_dt DATE)
> |USING parquet
> |PARTITIONED BY (cmpgn_run_dt)
>   """.stripMargin)
> sql(
>   """
> |select count(*)
> |from t1
> |where CMPGN_RUN_DT >= date_sub(EVENT_DT,2) and CMPGN_RUN_DT <= EVENT_DT
> |and EVENT_DT ='2022-04-05'
> |;
>   """.stripMargin).explain(true)
> {code}
> Excepted:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count(1)#4L]
> +- Project
>+- Filter (((isnotnull(CMPGN_RUN_DT#3) AND (CMPGN_RUN_DT#3 >= 2022-04-03)) 
> AND (CMPGN_RUN_DT#3 <= 2022-04-05)) AND (EVENT_DT#2 = 2022-04-05))
>   +- Relation default.t1[id#1,event_dt#2,cmpgn_run_dt#3] parquet
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#4L])
> +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#31]
>+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#7L])
>   +- *(1) Project
>  +- *(1) Filter (EVENT_DT#2 = 2022-04-05)
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[event_dt#2,cmpgn_run_dt#3] 
> Batched: true, DataFilters: [(event_dt#2 = 2022-04-05)], Format: Parquet, 
> Location: InMemoryFileIndex[], PartitionFilters: [isnotnull(cmpgn_run_dt#3), 
> (cmpgn_run_dt#3 >= 2022-04-03), (cmpgn_run_dt#3 <= 2022-04-05)], 
> PushedFilters: [EqualTo(event_dt,2022-04-05)], ReadSchema: 
> struct, UsedIndexes: []
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39069) Simplify another conditionals case in predicate

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529984#comment-17529984
 ] 

Apache Spark commented on SPARK-39069:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/36410

> Simplify another conditionals case in predicate
> ---
>
> Key: SPARK-39069
> URL: https://issues.apache.org/jira/browse/SPARK-39069
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> {code:scala}
> sql(
>   """
> |CREATE TABLE t1 (
> |  id DECIMAL(18,0),
> |  event_dt DATE,
> |  cmpgn_run_dt DATE)
> |USING parquet
> |PARTITIONED BY (cmpgn_run_dt)
>   """.stripMargin)
> sql(
>   """
> |select count(*)
> |from t1
> |where CMPGN_RUN_DT >= date_sub(EVENT_DT,2) and CMPGN_RUN_DT <= EVENT_DT
> |and EVENT_DT ='2022-04-05'
> |;
>   """.stripMargin).explain(true)
> {code}
> Excepted:
> {noformat}
> == Optimized Logical Plan ==
> Aggregate [count(1) AS count(1)#4L]
> +- Project
>+- Filter (((isnotnull(CMPGN_RUN_DT#3) AND (CMPGN_RUN_DT#3 >= 2022-04-03)) 
> AND (CMPGN_RUN_DT#3 <= 2022-04-05)) AND (EVENT_DT#2 = 2022-04-05))
>   +- Relation default.t1[id#1,event_dt#2,cmpgn_run_dt#3] parquet
> == Physical Plan ==
> *(2) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#4L])
> +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#31]
>+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#7L])
>   +- *(1) Project
>  +- *(1) Filter (EVENT_DT#2 = 2022-04-05)
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[event_dt#2,cmpgn_run_dt#3] 
> Batched: true, DataFilters: [(event_dt#2 = 2022-04-05)], Format: Parquet, 
> Location: InMemoryFileIndex[], PartitionFilters: [isnotnull(cmpgn_run_dt#3), 
> (cmpgn_run_dt#3 >= 2022-04-03), (cmpgn_run_dt#3 <= 2022-04-05)], 
> PushedFilters: [EqualTo(event_dt,2022-04-05)], ReadSchema: 
> struct, UsedIndexes: []
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39072) Fast Fail the remaining push blocks if shuffle stage finalized

2022-04-29 Thread Wan Kun (Jira)

Wan Kun created SPARK-39072:
---

 Summary: Fast Fail the remaining push blocks if shuffle stage 
finalized
 Key: SPARK-39072
 URL: https://issues.apache.org/jira/browse/SPARK-39072
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.3.0
Reporter: Wan Kun


Map task will try to push all map outputs to external shuffle service now.

After the shuffle stage is finalized, the reduce fetch blocks RPC will be 
blocked if there are still many map output blocks in flight.

We could stop pushing the remaining blocks if the shuffle stage is finalized.

 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38729) Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529961#comment-17529961
 ] 

Apache Spark commented on SPARK-38729:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36409

> Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK
> -
>
> Key: SPARK-38729
> URL: https://issues.apache.org/jira/browse/SPARK-38729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class 
> *FAILED_SET_ORIGINAL_PERMISSION_BACK* to QueryExecutionErrorsSuite. The test 
> should cover the exception throw in QueryExecutionErrors:
> {code:scala}
>   def failToSetOriginalPermissionBackError(
>   permission: FsPermission,
>   path: Path,
>   e: Throwable): Throwable = {
> new SparkSecurityException(errorClass = 
> "FAILED_SET_ORIGINAL_PERMISSION_BACK",
>   Array(permission.toString, path.toString, e.getMessage))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38729) Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529960#comment-17529960
 ] 

Apache Spark commented on SPARK-38729:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36409

> Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK
> -
>
> Key: SPARK-38729
> URL: https://issues.apache.org/jira/browse/SPARK-38729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class 
> *FAILED_SET_ORIGINAL_PERMISSION_BACK* to QueryExecutionErrorsSuite. The test 
> should cover the exception throw in QueryExecutionErrors:
> {code:scala}
>   def failToSetOriginalPermissionBackError(
>   permission: FsPermission,
>   path: Path,
>   e: Throwable): Throwable = {
> new SparkSecurityException(errorClass = 
> "FAILED_SET_ORIGINAL_PERMISSION_BACK",
>   Array(permission.toString, path.toString, e.getMessage))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38729) Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38729:


Assignee: (was: Apache Spark)

> Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK
> -
>
> Key: SPARK-38729
> URL: https://issues.apache.org/jira/browse/SPARK-38729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class 
> *FAILED_SET_ORIGINAL_PERMISSION_BACK* to QueryExecutionErrorsSuite. The test 
> should cover the exception throw in QueryExecutionErrors:
> {code:scala}
>   def failToSetOriginalPermissionBackError(
>   permission: FsPermission,
>   path: Path,
>   e: Throwable): Throwable = {
> new SparkSecurityException(errorClass = 
> "FAILED_SET_ORIGINAL_PERMISSION_BACK",
>   Array(permission.toString, path.toString, e.getMessage))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38729) Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38729:


Assignee: Apache Spark

> Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK
> -
>
> Key: SPARK-38729
> URL: https://issues.apache.org/jira/browse/SPARK-38729
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add at least one test for the error class 
> *FAILED_SET_ORIGINAL_PERMISSION_BACK* to QueryExecutionErrorsSuite. The test 
> should cover the exception throw in QueryExecutionErrors:
> {code:scala}
>   def failToSetOriginalPermissionBackError(
>   permission: FsPermission,
>   path: Path,
>   e: Throwable): Throwable = {
> new SparkSecurityException(errorClass = 
> "FAILED_SET_ORIGINAL_PERMISSION_BACK",
>   Array(permission.toString, path.toString, e.getMessage))
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException

2022-04-29 Thread Felix Altenberg (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Altenberg updated SPARK-39070:

Component/s: PySpark

> Pivoting a dataframe after two explodes raises an 
> org.codehaus.commons.compiler.CompileException
> 
>
> Key: SPARK-39070
> URL: https://issues.apache.org/jira/browse/SPARK-39070
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
> MacOS 12.3.1, python versions 3.8.10, 3.9.12
> Ubuntu 20.04, python version 3.8.10
> Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda 
> and pip
>Reporter: Felix Altenberg
>Priority: Major
> Attachments: spark_pivot_error.txt
>
>
> The following Code raises an exception starting in spark version 3.2.0
> {code:java}
> import pyspark.sql.functions as sf
> from pyspark.sql import Row
> data = [
> Row(
> other_value=10,
> ships=[
> Row(
> fields=[
> Row(name="field1", value=1),
> Row(name="field2", value=2),
> ]
> ),
> Row(
> fields=[
> Row(name="field1", value=3),
> Row(name="field3", value=4),
> ]
> ),
> ],
> )
> ]
> df = spark.createDataFrame(data)
> df = df.withColumn("ships", sf.explode("ships")).select(
> sf.col("other_value"), sf.col("ships.*")
> )
> df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
> "fields.*")
> df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
> The "pivot("name")" is what causes the error to be thrown.
> Having even one less explode (by manually transforming the input dataframe to 
> what it would look like after the first explode) causes this error not to 
> appear.
> Adding a .cache() before the groupBy and pivot also causes this error to 
> dissapear.
> Strangely, this error does not appear when running the above code in 
> Databricks using Databricks Runtime 10.4 which is also using spark version 
> 3.2.1
> I have attached the full stacktrace of the error below.
> I will try to investigate more after work today and will add anything else I 
> find.
> The exception raised is
> {noformat}
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 61, Column 78: A method named "numElements" is not declared in any enclosing 
> class nor any supertype, nor through a static import{noformat}
> but seems to be caused by this exception
> {noformat}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow{noformat}
> The following stackoverflow question seems to pertain to a very similar error
> [https://stackoverflow.com/questions/70356375/pyspark-pivot-unsafearraydata-cannot-be-cast-to-internalrow]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException

2022-04-29 Thread Felix Altenberg (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Altenberg updated SPARK-39070:

Description: 
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]

df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.

Having even one less explode (by manually transforming the input dataframe to 
what it would look like after the first explode) causes this error not to 
appear.
Adding a .cache() before the groupBy and pivot also causes this error to 
dissapear.

Strangely, this error does not appear when running the above code in Databricks 
using Databricks Runtime 10.4 which is also using spark version 3.2.1

I have attached the full stacktrace of the error below.
I will try to investigate more after work today and will add anything else I 
find.

The exception raised is
{noformat}
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 61, 
Column 78: A method named "numElements" is not declared in any enclosing class 
nor any supertype, nor through a static import{noformat}
but seems to be caused by this exception
{noformat}
Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow{noformat}
The following stackoverflow question seems to pertain to a very similar error
[https://stackoverflow.com/questions/70356375/pyspark-pivot-unsafearraydata-cannot-be-cast-to-internalrow]

  was:
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]

df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.

Having even one less explode (by manually transforming the input dataframe to 
what it would look like after the first explode) causes this error not to 
appear.
Adding a .cache() before the groupBy and pivot also causes this error to 
dissapear.

Strangely, this error does not appear when running the above code in Databricks 
using Databricks Runtime 10.4 which is also using spark version 3.2.1

I have attached the full stacktrace of the error below.
I will try to investigate more after work today and will add anything else I 
find.

The following stackoverflow question seems to pertain to a very similar error
[https://stackoverflow.com/questions/70356375/pyspark-pivot-unsafearraydata-cannot-be-cast-to-internalrow]


> Pivoting a dataframe after two explodes raises an 
> org.codehaus.commons.compiler.CompileException
> 
>
> Key: SPARK-39070
> URL: https://issues.apache.org/jira/browse/SPARK-39070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
> MacOS 12.3.1, python versions 3.8.10, 3.9.12
> Ubuntu 20.04, python version 3.8.10
> Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda 
> and pip
>Reporter: Felix Altenberg
>Priority: Major
> Attachments: spark_pivot_error.txt
>
>
> The following Code raises an exception starting in spark version 3.2.0
> {code:java}
> import pyspark.sql.functions as sf
> from pyspark.sql import Row
> data = [
> Row(
> other_value=10,
> ships=[
> Ro

[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException

2022-04-29 Thread Felix Altenberg (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Altenberg updated SPARK-39070:

Description: 
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]

df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.

Having even one less explode (by manually transforming the input dataframe to 
what it would look like after the first explode) causes this error not to 
appear.
Adding a .cache() before the groupBy and pivot also causes this error to 
dissapear.

Strangely, this error does not appear when running the above code in Databricks 
using Databricks Runtime 10.4 which is also using spark version 3.2.1

I have attached the full stacktrace of the error below.
I will try to investigate more after work today and will add anything else I 
find.

The following stackoverflow question seems to pertain to a very similar error
[https://stackoverflow.com/questions/70356375/pyspark-pivot-unsafearraydata-cannot-be-cast-to-internalrow]

  was:
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]

df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.

Having even one less explode (by manually transforming the input dataframe to 
what it would look like after the first explode) causes this error not to 
appear.
Adding a .cache() before the groupBy and pivot also causes this error to 
dissapear.

Strangely, this error does not appear when running the above code in Databricks 
using Databricks Runtime 10.4 which is also using spark version 3.2.1

I have attached the full stacktrace of the error below.
I will try to investigate more after work today and will add anything else I 
find.

 

 


> Pivoting a dataframe after two explodes raises an 
> org.codehaus.commons.compiler.CompileException
> 
>
> Key: SPARK-39070
> URL: https://issues.apache.org/jira/browse/SPARK-39070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
> MacOS 12.3.1, python versions 3.8.10, 3.9.12
> Ubuntu 20.04, python version 3.8.10
> Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda 
> and pip
>Reporter: Felix Altenberg
>Priority: Major
> Attachments: spark_pivot_error.txt
>
>
> The following Code raises an exception starting in spark version 3.2.0
> {code:java}
> import pyspark.sql.functions as sf
> from pyspark.sql import Row
> data = [
> Row(
> other_value=10,
> ships=[
> Row(
> fields=[
> Row(name="field1", value=1),
> Row(name="field2", value=2),
> ]
> ),
> Row(
> fields=[
> Row(name="field1", value=3),
> Row(name="field3", value=4),
> ]
> ),
> ],
> )
> ]
> df = spark.createDataFrame(data)
> df = df.withColumn("ships", sf.explode("ships")).select(
> sf.col("other_value"), sf.col("ships.*")
> )
> df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
> "fields.*")
> df = df.groupBy("other_value").pivot("na

[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException

2022-04-29 Thread Felix Altenberg (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Altenberg updated SPARK-39070:

Description: 
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]

df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.

Having even one less explode (by manually transforming the input dataframe to 
what it would look like after the first explode) causes this error not to 
appear.
Adding a .cache() before the groupBy and pivot also causes this error to 
dissapear.

Strangely, this error does not appear when running the above code in Databricks 
using Databricks Runtime 10.4 which is also using spark version 3.2.1

I have attached the full stacktrace of the error below.
I will try to investigate more after work today and will add anything else I 
find.

 

 

  was:
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]

df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.

Having even one less explode (by manually transforming the input dataframe to 
what it would look like after the first explode) causes this error not to 
appear.
Adding a .cache() before the groupBy and pivot also causes this error to 
dissapear.

I have attached the full stacktrace of the error below.
I will try to investigate more after work today and will add anything else I 
find.

 

 


> Pivoting a dataframe after two explodes raises an 
> org.codehaus.commons.compiler.CompileException
> 
>
> Key: SPARK-39070
> URL: https://issues.apache.org/jira/browse/SPARK-39070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
> MacOS 12.3.1, python versions 3.8.10, 3.9.12
> Ubuntu 20.04, python version 3.8.10
> Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda 
> and pip
>Reporter: Felix Altenberg
>Priority: Major
> Attachments: spark_pivot_error.txt
>
>
> The following Code raises an exception starting in spark version 3.2.0
> {code:java}
> import pyspark.sql.functions as sf
> from pyspark.sql import Row
> data = [
> Row(
> other_value=10,
> ships=[
> Row(
> fields=[
> Row(name="field1", value=1),
> Row(name="field2", value=2),
> ]
> ),
> Row(
> fields=[
> Row(name="field1", value=3),
> Row(name="field3", value=4),
> ]
> ),
> ],
> )
> ]
> df = spark.createDataFrame(data)
> df = df.withColumn("ships", sf.explode("ships")).select(
> sf.col("other_value"), sf.col("ships.*")
> )
> df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
> "fields.*")
> df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
> The "pivot("name")" is what causes the error to be thrown.
> Having even one less explode (by manually transforming the input dataframe to 
> what it would look like after the first explode) causes this error not to 
> appear.
> Adding a .cache() before the groupBy and pivot also causes this error

[jira] [Commented] (SPARK-39071) Add unwrap_udt function for unwrapping UserDefinedType columns

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529933#comment-17529933
 ] 

Apache Spark commented on SPARK-39071:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/36408

> Add unwrap_udt function for unwrapping UserDefinedType columns
> --
>
> Key: SPARK-39071
> URL: https://issues.apache.org/jira/browse/SPARK-39071
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Add unwrap_udt function for unwrapping UserDefinedType columns.
> This is useful in opensource project https://github.com/mengxr/pyspark-xgboost



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39071) Add unwrap_udt function for unwrapping UserDefinedType columns

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39071:


Assignee: (was: Apache Spark)

> Add unwrap_udt function for unwrapping UserDefinedType columns
> --
>
> Key: SPARK-39071
> URL: https://issues.apache.org/jira/browse/SPARK-39071
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Add unwrap_udt function for unwrapping UserDefinedType columns.
> This is useful in opensource project https://github.com/mengxr/pyspark-xgboost



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39071) Add unwrap_udt function for unwrapping UserDefinedType columns

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39071:


Assignee: Apache Spark

> Add unwrap_udt function for unwrapping UserDefinedType columns
> --
>
> Key: SPARK-39071
> URL: https://issues.apache.org/jira/browse/SPARK-39071
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
>
> Add unwrap_udt function for unwrapping UserDefinedType columns.
> This is useful in opensource project https://github.com/mengxr/pyspark-xgboost



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39071) Add unwrap_udt function for unwrapping UserDefinedType columns

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529932#comment-17529932
 ] 

Apache Spark commented on SPARK-39071:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/36408

> Add unwrap_udt function for unwrapping UserDefinedType columns
> --
>
> Key: SPARK-39071
> URL: https://issues.apache.org/jira/browse/SPARK-39071
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Add unwrap_udt function for unwrapping UserDefinedType columns.
> This is useful in opensource project https://github.com/mengxr/pyspark-xgboost



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39071) Add unwrap_udt function for unwrapping UserDefinedType columns

2022-04-29 Thread Weichen Xu (Jira)

Weichen Xu created SPARK-39071:
--

 Summary: Add unwrap_udt function for unwrapping UserDefinedType 
columns
 Key: SPARK-39071
 URL: https://issues.apache.org/jira/browse/SPARK-39071
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: Weichen Xu


Add unwrap_udt function for unwrapping UserDefinedType columns.

This is useful in opensource project https://github.com/mengxr/pyspark-xgboost



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39064) FileNotFoundException while reading from Kafka

2022-04-29 Thread Sushmita chauhan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sushmita chauhan updated SPARK-39064:
-
Description: 
We are running a stateful structured streaming job which reads from Kafka and 
writes to HDFS. And we are hitting this exception:

 

 

17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 
4, hcube1-1n03.eng.hortonworks.com, executor 1): 
java.lang.IllegalStateException: Error reading delta file 
/checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, part=0), 
dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta does not 
exist at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
 at scala.Option.getOrElse(Option.scala:121

 

Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory 
there is no file at all. While we have some files in the commits and offsets 
folders. I am not sure about the reason of this behavior. It seems to happen on 
the second time the job is started,

 

 If any more information is needed then I would be happy to provide it. Would 
also appreciate on some input on how to best resolve this issue? 

  was:
We are running a stateful structured streaming job which reads from Kafka and 
writes to HDFS. And we are hitting this exception:

 

 

17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. 
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 
4, hcube1-1n03.eng.hortonworks.com, executor 1): 
java.lang.IllegalStateException: Error reading delta file 
/checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, part=0), 
dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta does not 
exist at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
 at scala.Option.getOrElse(Option.scala:121) at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBacked

[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException

2022-04-29 Thread Felix Altenberg (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Altenberg updated SPARK-39070:

Environment: 
Tested on:
MacOS 12.3.1, python versions 3.8.10, 3.9.12
Ubuntu 20.04, python version 3.8.10

Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda and 
pip

  was:
Tested on:
MacOS 12.3.1, python versions 3.8.10, 3.9.12
Ubuntu 20.04, python version 3.8.10

Tested spark versions 3.2.1 and 3.2.0 installed through conda and pip


> Pivoting a dataframe after two explodes raises an 
> org.codehaus.commons.compiler.CompileException
> 
>
> Key: SPARK-39070
> URL: https://issues.apache.org/jira/browse/SPARK-39070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
> MacOS 12.3.1, python versions 3.8.10, 3.9.12
> Ubuntu 20.04, python version 3.8.10
> Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda 
> and pip
>Reporter: Felix Altenberg
>Priority: Major
> Attachments: spark_pivot_error.txt
>
>
> The following Code raises an exception starting in spark version 3.2.0
> {code:java}
> import pyspark.sql.functions as sf
> from pyspark.sql import Row
> data = [
> Row(
> other_value=10,
> ships=[
> Row(
> fields=[
> Row(name="field1", value=1),
> Row(name="field2", value=2),
> ]
> ),
> Row(
> fields=[
> Row(name="field1", value=3),
> Row(name="field3", value=4),
> ]
> ),
> ],
> )
> ]
> df = spark.createDataFrame(data)
> df = df.withColumn("ships", sf.explode("ships")).select(
> sf.col("other_value"), sf.col("ships.*")
> )
> df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
> "fields.*")
> df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
> The "pivot("name")" is what causes the error to be thrown.
> Having even one less explode (by manually transforming the input dataframe to 
> what it would look like after the first explode) causes this error not to 
> appear.
> Adding a .cache() before the groupBy and pivot also causes this error to 
> dissapear.
> I have attached the full stacktrace of the error below.
> I will try to investigate more after work today and will add anything else I 
> find.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException

2022-04-29 Thread Felix Altenberg (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Altenberg updated SPARK-39070:

Description: 
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]

df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.

Having even one less explode (by manually transforming the input dataframe to 
what it would look like after the first explode) causes this error not to 
appear.
Adding a .cache() before the groupBy and pivot also causes this error to 
dissapear.

I have attached the full stacktrace of the error below.
I will try to investigate more after work today and will add anything else I 
find.

 

 

  was:
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]

df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.

I have attached the full stacktrace of the error below.
I will try to investigate more after work today and will add anything else I 
find.

 

 


> Pivoting a dataframe after two explodes raises an 
> org.codehaus.commons.compiler.CompileException
> 
>
> Key: SPARK-39070
> URL: https://issues.apache.org/jira/browse/SPARK-39070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
> MacOS 12.3.1, python versions 3.8.10, 3.9.12
> Ubuntu 20.04, python version 3.8.10
> Tested spark versions 3.2.1 and 3.2.0 installed through conda and pip
>Reporter: Felix Altenberg
>Priority: Major
> Attachments: spark_pivot_error.txt
>
>
> The following Code raises an exception starting in spark version 3.2.0
> {code:java}
> import pyspark.sql.functions as sf
> from pyspark.sql import Row
> data = [
> Row(
> other_value=10,
> ships=[
> Row(
> fields=[
> Row(name="field1", value=1),
> Row(name="field2", value=2),
> ]
> ),
> Row(
> fields=[
> Row(name="field1", value=3),
> Row(name="field3", value=4),
> ]
> ),
> ],
> )
> ]
> df = spark.createDataFrame(data)
> df = df.withColumn("ships", sf.explode("ships")).select(
> sf.col("other_value"), sf.col("ships.*")
> )
> df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
> "fields.*")
> df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
> The "pivot("name")" is what causes the error to be thrown.
> Having even one less explode (by manually transforming the input dataframe to 
> what it would look like after the first explode) causes this error not to 
> appear.
> Adding a .cache() before the groupBy and pivot also causes this error to 
> dissapear.
> I have attached the full stacktrace of the error below.
> I will try to investigate more after work today and will add anything else I 
> find.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.a

[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException

2022-04-29 Thread Felix Altenberg (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Altenberg updated SPARK-39070:

Description: 
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]

df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.

I have attached the full stacktrace of the error below.
I will try to investigate more after work today and will add anything else I 
find.

 

 

  was:
The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]

df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.

I'll try to attach the full error message.



 


> Pivoting a dataframe after two explodes raises an 
> org.codehaus.commons.compiler.CompileException
> 
>
> Key: SPARK-39070
> URL: https://issues.apache.org/jira/browse/SPARK-39070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
> MacOS 12.3.1, python versions 3.8.10, 3.9.12
> Ubuntu 20.04, python version 3.8.10
> Tested spark versions 3.2.1 and 3.2.0 installed through conda and pip
>Reporter: Felix Altenberg
>Priority: Major
> Attachments: spark_pivot_error.txt
>
>
> The following Code raises an exception starting in spark version 3.2.0
> {code:java}
> import pyspark.sql.functions as sf
> from pyspark.sql import Row
> data = [
> Row(
> other_value=10,
> ships=[
> Row(
> fields=[
> Row(name="field1", value=1),
> Row(name="field2", value=2),
> ]
> ),
> Row(
> fields=[
> Row(name="field1", value=3),
> Row(name="field3", value=4),
> ]
> ),
> ],
> )
> ]
> df = spark.createDataFrame(data)
> df = df.withColumn("ships", sf.explode("ships")).select(
> sf.col("other_value"), sf.col("ships.*")
> )
> df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
> "fields.*")
> df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
> The "pivot("name")" is what causes the error to be thrown.
> I have attached the full stacktrace of the error below.
> I will try to investigate more after work today and will add anything else I 
> find.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException

2022-04-29 Thread Felix Altenberg (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Altenberg updated SPARK-39070:

Attachment: spark_pivot_error.txt

> Pivoting a dataframe after two explodes raises an 
> org.codehaus.commons.compiler.CompileException
> 
>
> Key: SPARK-39070
> URL: https://issues.apache.org/jira/browse/SPARK-39070
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.2.1
> Environment: Tested on:
> MacOS 12.3.1, python versions 3.8.10, 3.9.12
> Ubuntu 20.04, python version 3.8.10
> Tested spark versions 3.2.1 and 3.2.0 installed through conda and pip
>Reporter: Felix Altenberg
>Priority: Major
> Attachments: spark_pivot_error.txt
>
>
> The following Code raises an exception starting in spark version 3.2.0
> {code:java}
> import pyspark.sql.functions as sf
> from pyspark.sql import Row
> data = [
> Row(
> other_value=10,
> ships=[
> Row(
> fields=[
> Row(name="field1", value=1),
> Row(name="field2", value=2),
> ]
> ),
> Row(
> fields=[
> Row(name="field1", value=3),
> Row(name="field3", value=4),
> ]
> ),
> ],
> )
> ]
> df = spark.createDataFrame(data)
> df = df.withColumn("ships", sf.explode("ships")).select(
> sf.col("other_value"), sf.col("ships.*")
> )
> df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
> "fields.*")
> df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
> The "pivot("name")" is what causes the error to be thrown.
> I'll try to attach the full error message.
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException

2022-04-29 Thread Felix Altenberg (Jira)

Felix Altenberg created SPARK-39070:
---

 Summary: Pivoting a dataframe after two explodes raises an 
org.codehaus.commons.compiler.CompileException
 Key: SPARK-39070
 URL: https://issues.apache.org/jira/browse/SPARK-39070
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.2.0
 Environment: Tested on:
MacOS 12.3.1, python versions 3.8.10, 3.9.12
Ubuntu 20.04, python version 3.8.10

Tested spark versions 3.2.1 and 3.2.0 installed through conda and pip
Reporter: Felix Altenberg


The following Code raises an exception starting in spark version 3.2.0
{code:java}
import pyspark.sql.functions as sf
from pyspark.sql import Row
data = [
Row(
other_value=10,
ships=[
Row(
fields=[
Row(name="field1", value=1),
Row(name="field2", value=2),
]
),
Row(
fields=[
Row(name="field1", value=3),
Row(name="field3", value=4),
]
),
],
)
]

df = spark.createDataFrame(data)
df = df.withColumn("ships", sf.explode("ships")).select(
sf.col("other_value"), sf.col("ships.*")
)
df = df.withColumn("fields", sf.explode("fields")).select("other_value", 
"fields.*")
df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code}
The "pivot("name")" is what causes the error to be thrown.

I'll try to attach the full error message.



 



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39064) FileNotFoundException while reading from Kafka

2022-04-29 Thread Sushmita chauhan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529922#comment-17529922
 ] 

Sushmita chauhan commented on SPARK-39064:
--

Hi,

 

I m using Spark structured Streaming and spark sql.

I m reading data through kafka with spark windowing operation and union the 
kafka and hive data , make single dataframe . m trying to find latest data with 
timestamp through spark window function and row_number then save dataframe in 
hive table. 

 

val sparkSession: SparkSession =
SparkSession.builder().appName(SPARK_APP_NAME)

.config("hive.metastore.warehouse.dir", 
SPARK_HIVE_WAREHOUSE_DIR_SUSPICIOUS_USER)
// .config("spark.default.parallelism",70)
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("hive.support.concurrency", true)
.config("spark.sql.orc.impl","native")
.config("spark.speculation","false")

query = kakfa_SuspciousUser
.selectExpr(
"nullif(TimeStamp, \"\") as TimeStamp",
"nullif(UserName, \"\") as UserName"
)
.groupBy(window(col("TimeStamp"), "30 minute"), col("UserName"))
.agg(functions.count("UserName").alias("user_UserCount"))
.select("UserName","window.end", "user_UserCount")
.filter(col("user_UserCount").gt(4))
.writeStream

.foreachBatch(function = (batchDF: Dataset[Row], batchId: Long) => {

println("drop table if exists CompromiseUserID_transaction")


val hive_target_Table = hiveCache.sparkSession.table("User_temp_table")

println("before hive")
/* WRITE FINAL DATASET IN SuspiciousIP HIVE TABLE */
hive_target_Table.write.mode(SaveMode.Append).format("hive").saveAsTable("parichaynew.CompromiseUserIDTable")

 

})

.outputMode("update").trigger(Trigger.ProcessingTime(10, TimeUnit.SECONDS))

.option("checkpointLocation",CHECK_POINT)
.start()
query.awaitTermination()

 

 

 

// This is my code . Please help to find the issue .

My streaming code running 5 hours continueously  after 5 hours The issue starts 
with the usual *FileNotFoundException* that happens with HDFS.

Error reading delta file /checkpointDir/state/0/0/1.delta of 
HDFSStateStoreProvider[id = (op=0, part=0), dir = /checkpointDir/state/0/0]: 
/checkpointDir/state/0/0/1.delta does not exist at

 

> FileNotFoundException while reading from Kafka
> --
>
> Key: SPARK-39064
> URL: https://issues.apache.org/jira/browse/SPARK-39064
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.8
>Reporter: Sushmita chauhan
>Priority: Critical
> Fix For: 2.4.8
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> We are running a stateful structured streaming job which reads from Kafka and 
> writes to HDFS. And we are hitting this exception:
>  
>  
> 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): 
> java.lang.IllegalStateException: Error reading delta file 
> /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, 
> part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta 
> does not exist at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$H

[jira] [Created] (SPARK-39069) Simplify another conditionals case in predicate

2022-04-29 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-39069:
---

 Summary: Simplify another conditionals case in predicate
 Key: SPARK-39069
 URL: https://issues.apache.org/jira/browse/SPARK-39069
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yuming Wang



{code:scala}
sql(
  """
|CREATE TABLE t1 (
|  id DECIMAL(18,0),
|  event_dt DATE,
|  cmpgn_run_dt DATE)
|USING parquet
|PARTITIONED BY (cmpgn_run_dt)
  """.stripMargin)

sql(
  """
|select count(*)
|from t1
|where CMPGN_RUN_DT >= date_sub(EVENT_DT,2) and CMPGN_RUN_DT <= EVENT_DT
|and EVENT_DT ='2022-04-05'

|;
  """.stripMargin).explain(true)
{code}

Excepted:
{noformat}
== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#4L]
+- Project
   +- Filter (((isnotnull(CMPGN_RUN_DT#3) AND (CMPGN_RUN_DT#3 >= 2022-04-03)) 
AND (CMPGN_RUN_DT#3 <= 2022-04-05)) AND (EVENT_DT#2 = 2022-04-05))
  +- Relation default.t1[id#1,event_dt#2,cmpgn_run_dt#3] parquet

== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#4L])
+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#31]
   +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], 
output=[count#7L])
  +- *(1) Project
 +- *(1) Filter (EVENT_DT#2 = 2022-04-05)
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[event_dt#2,cmpgn_run_dt#3] 
Batched: true, DataFilters: [(event_dt#2 = 2022-04-05)], Format: Parquet, 
Location: InMemoryFileIndex[], PartitionFilters: [isnotnull(cmpgn_run_dt#3), 
(cmpgn_run_dt#3 >= 2022-04-03), (cmpgn_run_dt#3 <= 2022-04-05)], PushedFilters: 
[EqualTo(event_dt,2022-04-05)], ReadSchema: struct, UsedIndexes: 
[]

{noformat}




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39068) Make thriftserver and sparksql-cli support in-memory catalog

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529915#comment-17529915
 ] 

Apache Spark commented on SPARK-39068:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/36407

> Make thriftserver and sparksql-cli support in-memory catalog
> 
>
> Key: SPARK-39068
> URL: https://issues.apache.org/jira/browse/SPARK-39068
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> # with in-memory, we can start multiple instances in the same folder without 
> the limitation of derby.
>  # spark now is multi catalog architecture, the built-in catalog is not 
> always necessary, in-memory is much lightweight if it is unused



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39068) Make thriftserver and sparksql-cli support in-memory catalog

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39068:


Assignee: (was: Apache Spark)

> Make thriftserver and sparksql-cli support in-memory catalog
> 
>
> Key: SPARK-39068
> URL: https://issues.apache.org/jira/browse/SPARK-39068
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>
> # with in-memory, we can start multiple instances in the same folder without 
> the limitation of derby.
>  # spark now is multi catalog architecture, the built-in catalog is not 
> always necessary, in-memory is much lightweight if it is unused



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39068) Make thriftserver and sparksql-cli support in-memory catalog

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39068:


Assignee: Apache Spark

> Make thriftserver and sparksql-cli support in-memory catalog
> 
>
> Key: SPARK-39068
> URL: https://issues.apache.org/jira/browse/SPARK-39068
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> # with in-memory, we can start multiple instances in the same folder without 
> the limitation of derby.
>  # spark now is multi catalog architecture, the built-in catalog is not 
> always necessary, in-memory is much lightweight if it is unused



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39068) Make thriftserver and sparksql-cli support in-memory catalog

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529914#comment-17529914
 ] 

Apache Spark commented on SPARK-39068:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/36407

> Make thriftserver and sparksql-cli support in-memory catalog
> 
>
> Key: SPARK-39068
> URL: https://issues.apache.org/jira/browse/SPARK-39068
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Priority: Major
>
> # with in-memory, we can start multiple instances in the same folder without 
> the limitation of derby.
>  # spark now is multi catalog architecture, the built-in catalog is not 
> always necessary, in-memory is much lightweight if it is unused



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39068) Make thriftserver and sparksql-cli support in-memory catalog

2022-04-29 Thread Kent Yao (Jira)

Kent Yao created SPARK-39068:


 Summary: Make thriftserver and sparksql-cli support in-memory 
catalog
 Key: SPARK-39068
 URL: https://issues.apache.org/jira/browse/SPARK-39068
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Kent Yao


# with in-memory, we can start multiple instances in the same folder without 
the limitation of derby.
 # spark now is multi catalog architecture, the built-in catalog is not always 
necessary, in-memory is much lightweight if it is unused



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39064) FileNotFoundException while reading from Kafka

2022-04-29 Thread oskarryn (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529902#comment-17529902
 ] 

oskarryn commented on SPARK-39064:
--

Related https://issues.apache.org/jira/browse/SPARK-26359
If you don't have a tmp delta file to rename like in the issue above, then 
probably the fastest workaround for you is the deletion of the checkpointDir 
(or defining a new checkpoint location). If the Spark reader is configured to 
start from the latest offset, this means probably losing some data. If the 
reader is configured to use the earliest offset, then some data may be reread 
and cause issues downstream (duplicates).

Would be great if someone could explain what causes this issue and/or how to 
deal with this error in a better way (using the same checkpointing directory).

> FileNotFoundException while reading from Kafka
> --
>
> Key: SPARK-39064
> URL: https://issues.apache.org/jira/browse/SPARK-39064
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.8
>Reporter: Sushmita chauhan
>Priority: Critical
> Fix For: 2.4.8
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> We are running a stateful structured streaming job which reads from Kafka and 
> writes to HDFS. And we are hitting this exception:
>  
>  
> 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): 
> java.lang.IllegalStateException: Error reading delta file 
> /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, 
> part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta 
> does not exist at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
>  at scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360)
>  at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359)
>  at scala.Option.getOrElse(Option.scala:121
>  
>  
>  
> Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory 
> there is no file at all. While we have some files in the commits and offsets 
> folders. I am not sure about the reason of this behavior. It seems to happen 
> on the second time the job is started,



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39067) Upgrade scala-maven-plugin to 4.6.1

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529898#comment-17529898
 ] 

Apache Spark commented on SPARK-39067:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/36406

> Upgrade scala-maven-plugin to 4.6.1
> ---
>
> Key: SPARK-39067
> URL: https://issues.apache.org/jira/browse/SPARK-39067
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39067) Upgrade scala-maven-plugin to 4.6.1

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39067:


Assignee: Apache Spark

> Upgrade scala-maven-plugin to 4.6.1
> ---
>
> Key: SPARK-39067
> URL: https://issues.apache.org/jira/browse/SPARK-39067
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39067) Upgrade scala-maven-plugin to 4.6.1

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39067:


Assignee: (was: Apache Spark)

> Upgrade scala-maven-plugin to 4.6.1
> ---
>
> Key: SPARK-39067
> URL: https://issues.apache.org/jira/browse/SPARK-39067
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39067) Upgrade scala-maven-plugin to 4.6.1

2022-04-29 Thread Yang Jie (Jira)

Yang Jie created SPARK-39067:


 Summary: Upgrade scala-maven-plugin to 4.6.1
 Key: SPARK-39067
 URL: https://issues.apache.org/jira/browse/SPARK-39067
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.4.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2022-04-29 Thread Mayank Asthana (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529867#comment-17529867
 ] 

Mayank Asthana edited comment on SPARK-26365 at 4/29/22 8:51 AM:
-

[~unnamed101] Your change looks good. Can you open a pull request on 
[https://github.com/apache/spark] for an official review?


was (Author: masthana):
[~oscar.bonilla] Your change looks good. Can you open a pull request on 
[https://github.com/apache/spark] for an official review?

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Spark Submit
>Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0
>Reporter: Oscar Bonilla
>Priority: Major
> Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, 
> spark-3.0.0-raise-exception-k8s-failure.patch
>
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code

2022-04-29 Thread Mayank Asthana (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529867#comment-17529867
 ] 

Mayank Asthana commented on SPARK-26365:


[~oscar.bonilla] Your change looks good. Can you open a pull request on 
[https://github.com/apache/spark] for an official review?

> spark-submit for k8s cluster doesn't propagate exit code
> 
>
> Key: SPARK-26365
> URL: https://issues.apache.org/jira/browse/SPARK-26365
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core, Spark Submit
>Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0
>Reporter: Oscar Bonilla
>Priority: Major
> Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, 
> spark-3.0.0-raise-exception-k8s-failure.patch
>
>
> When launching apps using spark-submit in a kubernetes cluster, if the Spark 
> applications fails (returns exit code = 1 for example), spark-submit will 
> still exit gracefully and return exit code = 0.
> This is problematic, since there's no way to know if there's been a problem 
> with the Spark application.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-39066) Run `./dev/test-dependencies.sh --replace-manifest ` use Maven 3.8.5 produce wrong result

2022-04-29 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-39066:
-
Attachment: spark-deps-hadoop-2-hive-2.3

> Run `./dev/test-dependencies.sh --replace-manifest ` use Maven 3.8.5 produce 
> wrong result
> -
>
> Key: SPARK-39066
> URL: https://issues.apache.org/jira/browse/SPARK-39066
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
> Attachments: hadoop2-deps.txt, spark-deps-hadoop-2-hive-2.3
>
>
> Run `./dev/test-dependencies.sh --replace-manifest ` use maven 3.8.5 produce 
> wrong result:
> {code:java}
> diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 
> b/dev/deps/spark-deps-hadoop-2-hive-2.3
> index b6df3ea5ce..e803aadcfc 100644
> --- a/dev/deps/spark-deps-hadoop-2-hive-2.3
> +++ b/dev/deps/spark-deps-hadoop-2-hive-2.3
> @@ -6,15 +6,14 @@ ST4/4.0.4//ST4-4.0.4.jar
>  activation/1.1.1//activation-1.1.1.jar
>  aircompressor/0.21//aircompressor-0.21.jar
>  algebra_2.12/2.0.1//algebra_2.12-2.0.1.jar
> +aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar
> +aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar
> +aliyun-java-sdk-ram/3.1.0//aliyun-java-sdk-ram-3.1.0.jar
> +aliyun-sdk-oss/3.13.0//aliyun-sdk-oss-3.13.0.jar
>  annotations/17.0.0//annotations-17.0.0.jar
>  antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar
>  antlr4-runtime/4.8//antlr4-runtime-4.8.jar
>  aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar
> -aopalliance/1.0//aopalliance-1.0.jar
> -apacheds-i18n/2.0.0-M15//apacheds-i18n-2.0.0-M15.jar
> -apacheds-kerberos-codec/2.0.0-M15//apacheds-kerberos-codec-2.0.0-M15.jar
> -api-asn1-api/1.0.0-M20//api-asn1-api-1.0.0-M20.jar
> -api-util/1.0.0-M20//api-util-1.0.0-M20.jar
>  arpack/2.2.1//arpack-2.2.1.jar
>  arpack_combined_all/0.1//arpack_combined_all-0.1.jar
>  arrow-format/7.0.0//arrow-format-7.0.0.jar
> @@ -26,7 +25,10 @@ automaton/1.11-8//automaton-1.11-8.jar
>  avro-ipc/1.11.0//avro-ipc-1.11.0.jar
>  avro-mapred/1.11.0//avro-mapred-1.11.0.jar
>  avro/1.11.0//avro-1.11.0.jar
> -azure-storage/2.0.0//azure-storage-2.0.0.jar
> +aws-java-sdk-bundle/1.11.1026//aws-java-sdk-bundle-1.11.1026.jar
> +azure-data-lake-store-sdk/2.3.9//azure-data-lake-store-sdk-2.3.9.jar
> +azure-keyvault-core/1.0.0//azure-keyvault-core-1.0.0.jar
> +azure-storage/7.0.1//azure-storage-7.0.1.jar
>  blas/2.2.1//blas-2.2.1.jar
>  bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar
>  breeze-macros_2.12/1.2//breeze-macros_2.12-1.2.jar
> @@ -34,28 +36,24 @@ breeze_2.12/1.2//breeze_2.12-1.2.jar
>  cats-kernel_2.12/2.1.1//cats-kernel_2.12-2.1.1.jar
>  chill-java/0.10.0//chill-java-0.10.0.jar
>  chill_2.12/0.10.0//chill_2.12-0.10.0.jar
> -commons-beanutils/1.9.4//commons-beanutils-1.9.4.jar
>  commons-cli/1.5.0//commons-cli-1.5.0.jar
>  commons-codec/1.15//commons-codec-1.15.jar
>  commons-collections/3.2.2//commons-collections-3.2.2.jar
>  commons-collections4/4.4//commons-collections4-4.4.jar
>  commons-compiler/3.0.16//commons-compiler-3.0.16.jar
>  commons-compress/1.21//commons-compress-1.21.jar
> -commons-configuration/1.6//commons-configuration-1.6.jar
>  commons-crypto/1.1.0//commons-crypto-1.1.0.jar
>  commons-dbcp/1.4//commons-dbcp-1.4.jar
> -commons-digester/1.8//commons-digester-1.8.jar
> -commons-httpclient/3.1//commons-httpclient-3.1.jar
>  commons-io/2.4//commons-io-2.4.jar
>  commons-lang/2.6//commons-lang-2.6.jar
>  commons-lang3/3.12.0//commons-lang3-3.12.0.jar
>  commons-logging/1.1.3//commons-logging-1.1.3.jar
>  commons-math3/3.6.1//commons-math3-3.6.1.jar
> -commons-net/3.1//commons-net-3.1.jar
>  commons-pool/1.5.4//commons-pool-1.5.4.jar
>  commons-text/1.9//commons-text-1.9.jar
>  compress-lzf/1.1//compress-lzf-1.1.jar
>  core/1.1.2//core-1.1.2.jar
> +cos_api-bundle/5.6.19//cos_api-bundle-5.6.19.jar
>  curator-client/2.7.1//curator-client-2.7.1.jar
>  curator-framework/2.7.1//curator-framework-2.7.1.jar
>  curator-recipes/2.7.1//curator-recipes-2.7.1.jar
> @@ -69,25 +67,17 @@ generex/1.0.2//generex-1.0.2.jar
>  gmetric4j/1.0.10//gmetric4j-1.0.10.jar
>  gson/2.2.4//gson-2.2.4.jar
>  guava/14.0.1//guava-14.0.1.jar
> -guice-servlet/3.0//guice-servlet-3.0.jar
> -guice/3.0//guice-3.0.jar
> -hadoop-annotations/2.7.4//hadoop-annotations-2.7.4.jar
> -hadoop-auth/2.7.4//hadoop-auth-2.7.4.jar
> -hadoop-aws/2.7.4//hadoop-aws-2.7.4.jar
> -hadoop-azure/2.7.4//hadoop-azure-2.7.4.jar
> -hadoop-client/2.7.4//hadoop-client-2.7.4.jar
> -hadoop-common/2.7.4//hadoop-common-2.7.4.jar
> -hadoop-hdfs/2.7.4//hadoop-hdfs-2.7.4.jar
> -hadoop-mapreduce-client-app/2.7.4//hadoop-mapreduce-client-app-2.7.4.jar
> -hadoop-mapreduce-client-common/2.7.4//hadoop-mapreduce-client-common-2.7.4.jar
> -hadoop-mapreduce-client-core/2.7.4//hadoop-mapred

[jira] [Updated] (SPARK-39066) Run `./dev/test-dependencies.sh --replace-manifest ` use Maven 3.8.5 produce wrong result

2022-04-29 Thread Yang Jie (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie updated SPARK-39066:
-
Attachment: hadoop2-deps.txt

> Run `./dev/test-dependencies.sh --replace-manifest ` use Maven 3.8.5 produce 
> wrong result
> -
>
> Key: SPARK-39066
> URL: https://issues.apache.org/jira/browse/SPARK-39066
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Yang Jie
>Priority: Major
> Attachments: hadoop2-deps.txt
>
>
> Run `./dev/test-dependencies.sh --replace-manifest ` use maven 3.8.5 produce 
> wrong result:
> {code:java}
> diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 
> b/dev/deps/spark-deps-hadoop-2-hive-2.3
> index b6df3ea5ce..e803aadcfc 100644
> --- a/dev/deps/spark-deps-hadoop-2-hive-2.3
> +++ b/dev/deps/spark-deps-hadoop-2-hive-2.3
> @@ -6,15 +6,14 @@ ST4/4.0.4//ST4-4.0.4.jar
>  activation/1.1.1//activation-1.1.1.jar
>  aircompressor/0.21//aircompressor-0.21.jar
>  algebra_2.12/2.0.1//algebra_2.12-2.0.1.jar
> +aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar
> +aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar
> +aliyun-java-sdk-ram/3.1.0//aliyun-java-sdk-ram-3.1.0.jar
> +aliyun-sdk-oss/3.13.0//aliyun-sdk-oss-3.13.0.jar
>  annotations/17.0.0//annotations-17.0.0.jar
>  antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar
>  antlr4-runtime/4.8//antlr4-runtime-4.8.jar
>  aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar
> -aopalliance/1.0//aopalliance-1.0.jar
> -apacheds-i18n/2.0.0-M15//apacheds-i18n-2.0.0-M15.jar
> -apacheds-kerberos-codec/2.0.0-M15//apacheds-kerberos-codec-2.0.0-M15.jar
> -api-asn1-api/1.0.0-M20//api-asn1-api-1.0.0-M20.jar
> -api-util/1.0.0-M20//api-util-1.0.0-M20.jar
>  arpack/2.2.1//arpack-2.2.1.jar
>  arpack_combined_all/0.1//arpack_combined_all-0.1.jar
>  arrow-format/7.0.0//arrow-format-7.0.0.jar
> @@ -26,7 +25,10 @@ automaton/1.11-8//automaton-1.11-8.jar
>  avro-ipc/1.11.0//avro-ipc-1.11.0.jar
>  avro-mapred/1.11.0//avro-mapred-1.11.0.jar
>  avro/1.11.0//avro-1.11.0.jar
> -azure-storage/2.0.0//azure-storage-2.0.0.jar
> +aws-java-sdk-bundle/1.11.1026//aws-java-sdk-bundle-1.11.1026.jar
> +azure-data-lake-store-sdk/2.3.9//azure-data-lake-store-sdk-2.3.9.jar
> +azure-keyvault-core/1.0.0//azure-keyvault-core-1.0.0.jar
> +azure-storage/7.0.1//azure-storage-7.0.1.jar
>  blas/2.2.1//blas-2.2.1.jar
>  bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar
>  breeze-macros_2.12/1.2//breeze-macros_2.12-1.2.jar
> @@ -34,28 +36,24 @@ breeze_2.12/1.2//breeze_2.12-1.2.jar
>  cats-kernel_2.12/2.1.1//cats-kernel_2.12-2.1.1.jar
>  chill-java/0.10.0//chill-java-0.10.0.jar
>  chill_2.12/0.10.0//chill_2.12-0.10.0.jar
> -commons-beanutils/1.9.4//commons-beanutils-1.9.4.jar
>  commons-cli/1.5.0//commons-cli-1.5.0.jar
>  commons-codec/1.15//commons-codec-1.15.jar
>  commons-collections/3.2.2//commons-collections-3.2.2.jar
>  commons-collections4/4.4//commons-collections4-4.4.jar
>  commons-compiler/3.0.16//commons-compiler-3.0.16.jar
>  commons-compress/1.21//commons-compress-1.21.jar
> -commons-configuration/1.6//commons-configuration-1.6.jar
>  commons-crypto/1.1.0//commons-crypto-1.1.0.jar
>  commons-dbcp/1.4//commons-dbcp-1.4.jar
> -commons-digester/1.8//commons-digester-1.8.jar
> -commons-httpclient/3.1//commons-httpclient-3.1.jar
>  commons-io/2.4//commons-io-2.4.jar
>  commons-lang/2.6//commons-lang-2.6.jar
>  commons-lang3/3.12.0//commons-lang3-3.12.0.jar
>  commons-logging/1.1.3//commons-logging-1.1.3.jar
>  commons-math3/3.6.1//commons-math3-3.6.1.jar
> -commons-net/3.1//commons-net-3.1.jar
>  commons-pool/1.5.4//commons-pool-1.5.4.jar
>  commons-text/1.9//commons-text-1.9.jar
>  compress-lzf/1.1//compress-lzf-1.1.jar
>  core/1.1.2//core-1.1.2.jar
> +cos_api-bundle/5.6.19//cos_api-bundle-5.6.19.jar
>  curator-client/2.7.1//curator-client-2.7.1.jar
>  curator-framework/2.7.1//curator-framework-2.7.1.jar
>  curator-recipes/2.7.1//curator-recipes-2.7.1.jar
> @@ -69,25 +67,17 @@ generex/1.0.2//generex-1.0.2.jar
>  gmetric4j/1.0.10//gmetric4j-1.0.10.jar
>  gson/2.2.4//gson-2.2.4.jar
>  guava/14.0.1//guava-14.0.1.jar
> -guice-servlet/3.0//guice-servlet-3.0.jar
> -guice/3.0//guice-3.0.jar
> -hadoop-annotations/2.7.4//hadoop-annotations-2.7.4.jar
> -hadoop-auth/2.7.4//hadoop-auth-2.7.4.jar
> -hadoop-aws/2.7.4//hadoop-aws-2.7.4.jar
> -hadoop-azure/2.7.4//hadoop-azure-2.7.4.jar
> -hadoop-client/2.7.4//hadoop-client-2.7.4.jar
> -hadoop-common/2.7.4//hadoop-common-2.7.4.jar
> -hadoop-hdfs/2.7.4//hadoop-hdfs-2.7.4.jar
> -hadoop-mapreduce-client-app/2.7.4//hadoop-mapreduce-client-app-2.7.4.jar
> -hadoop-mapreduce-client-common/2.7.4//hadoop-mapreduce-client-common-2.7.4.jar
> -hadoop-mapreduce-client-core/2.7.4//hadoop-mapreduce-client-core-2.7.4.jar
> -hadoop-mapred

[jira] [Created] (SPARK-39066) Run `./dev/test-dependencies.sh --replace-manifest ` use Maven 3.8.5 produce wrong result

2022-04-29 Thread Yang Jie (Jira)

Yang Jie created SPARK-39066:


 Summary: Run `./dev/test-dependencies.sh --replace-manifest ` use 
Maven 3.8.5 produce wrong result
 Key: SPARK-39066
 URL: https://issues.apache.org/jira/browse/SPARK-39066
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.4.0
Reporter: Yang Jie


Run `./dev/test-dependencies.sh --replace-manifest ` use maven 3.8.5 produce 
wrong result:
{code:java}
diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 
b/dev/deps/spark-deps-hadoop-2-hive-2.3
index b6df3ea5ce..e803aadcfc 100644
--- a/dev/deps/spark-deps-hadoop-2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2-hive-2.3
@@ -6,15 +6,14 @@ ST4/4.0.4//ST4-4.0.4.jar
 activation/1.1.1//activation-1.1.1.jar
 aircompressor/0.21//aircompressor-0.21.jar
 algebra_2.12/2.0.1//algebra_2.12-2.0.1.jar
+aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar
+aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar
+aliyun-java-sdk-ram/3.1.0//aliyun-java-sdk-ram-3.1.0.jar
+aliyun-sdk-oss/3.13.0//aliyun-sdk-oss-3.13.0.jar
 annotations/17.0.0//annotations-17.0.0.jar
 antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar
 antlr4-runtime/4.8//antlr4-runtime-4.8.jar
 aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar
-aopalliance/1.0//aopalliance-1.0.jar
-apacheds-i18n/2.0.0-M15//apacheds-i18n-2.0.0-M15.jar
-apacheds-kerberos-codec/2.0.0-M15//apacheds-kerberos-codec-2.0.0-M15.jar
-api-asn1-api/1.0.0-M20//api-asn1-api-1.0.0-M20.jar
-api-util/1.0.0-M20//api-util-1.0.0-M20.jar
 arpack/2.2.1//arpack-2.2.1.jar
 arpack_combined_all/0.1//arpack_combined_all-0.1.jar
 arrow-format/7.0.0//arrow-format-7.0.0.jar
@@ -26,7 +25,10 @@ automaton/1.11-8//automaton-1.11-8.jar
 avro-ipc/1.11.0//avro-ipc-1.11.0.jar
 avro-mapred/1.11.0//avro-mapred-1.11.0.jar
 avro/1.11.0//avro-1.11.0.jar
-azure-storage/2.0.0//azure-storage-2.0.0.jar
+aws-java-sdk-bundle/1.11.1026//aws-java-sdk-bundle-1.11.1026.jar
+azure-data-lake-store-sdk/2.3.9//azure-data-lake-store-sdk-2.3.9.jar
+azure-keyvault-core/1.0.0//azure-keyvault-core-1.0.0.jar
+azure-storage/7.0.1//azure-storage-7.0.1.jar
 blas/2.2.1//blas-2.2.1.jar
 bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar
 breeze-macros_2.12/1.2//breeze-macros_2.12-1.2.jar
@@ -34,28 +36,24 @@ breeze_2.12/1.2//breeze_2.12-1.2.jar
 cats-kernel_2.12/2.1.1//cats-kernel_2.12-2.1.1.jar
 chill-java/0.10.0//chill-java-0.10.0.jar
 chill_2.12/0.10.0//chill_2.12-0.10.0.jar
-commons-beanutils/1.9.4//commons-beanutils-1.9.4.jar
 commons-cli/1.5.0//commons-cli-1.5.0.jar
 commons-codec/1.15//commons-codec-1.15.jar
 commons-collections/3.2.2//commons-collections-3.2.2.jar
 commons-collections4/4.4//commons-collections4-4.4.jar
 commons-compiler/3.0.16//commons-compiler-3.0.16.jar
 commons-compress/1.21//commons-compress-1.21.jar
-commons-configuration/1.6//commons-configuration-1.6.jar
 commons-crypto/1.1.0//commons-crypto-1.1.0.jar
 commons-dbcp/1.4//commons-dbcp-1.4.jar
-commons-digester/1.8//commons-digester-1.8.jar
-commons-httpclient/3.1//commons-httpclient-3.1.jar
 commons-io/2.4//commons-io-2.4.jar
 commons-lang/2.6//commons-lang-2.6.jar
 commons-lang3/3.12.0//commons-lang3-3.12.0.jar
 commons-logging/1.1.3//commons-logging-1.1.3.jar
 commons-math3/3.6.1//commons-math3-3.6.1.jar
-commons-net/3.1//commons-net-3.1.jar
 commons-pool/1.5.4//commons-pool-1.5.4.jar
 commons-text/1.9//commons-text-1.9.jar
 compress-lzf/1.1//compress-lzf-1.1.jar
 core/1.1.2//core-1.1.2.jar
+cos_api-bundle/5.6.19//cos_api-bundle-5.6.19.jar
 curator-client/2.7.1//curator-client-2.7.1.jar
 curator-framework/2.7.1//curator-framework-2.7.1.jar
 curator-recipes/2.7.1//curator-recipes-2.7.1.jar
@@ -69,25 +67,17 @@ generex/1.0.2//generex-1.0.2.jar
 gmetric4j/1.0.10//gmetric4j-1.0.10.jar
 gson/2.2.4//gson-2.2.4.jar
 guava/14.0.1//guava-14.0.1.jar
-guice-servlet/3.0//guice-servlet-3.0.jar
-guice/3.0//guice-3.0.jar
-hadoop-annotations/2.7.4//hadoop-annotations-2.7.4.jar
-hadoop-auth/2.7.4//hadoop-auth-2.7.4.jar
-hadoop-aws/2.7.4//hadoop-aws-2.7.4.jar
-hadoop-azure/2.7.4//hadoop-azure-2.7.4.jar
-hadoop-client/2.7.4//hadoop-client-2.7.4.jar
-hadoop-common/2.7.4//hadoop-common-2.7.4.jar
-hadoop-hdfs/2.7.4//hadoop-hdfs-2.7.4.jar
-hadoop-mapreduce-client-app/2.7.4//hadoop-mapreduce-client-app-2.7.4.jar
-hadoop-mapreduce-client-common/2.7.4//hadoop-mapreduce-client-common-2.7.4.jar
-hadoop-mapreduce-client-core/2.7.4//hadoop-mapreduce-client-core-2.7.4.jar
-hadoop-mapreduce-client-jobclient/2.7.4//hadoop-mapreduce-client-jobclient-2.7.4.jar
-hadoop-mapreduce-client-shuffle/2.7.4//hadoop-mapreduce-client-shuffle-2.7.4.jar
-hadoop-openstack/2.7.4//hadoop-openstack-2.7.4.jar
-hadoop-yarn-api/2.7.4//hadoop-yarn-api-2.7.4.jar
-hadoop-yarn-client/2.7.4//hadoop-yarn-client-2.7.4.jar
-hadoop-yarn-common/2.7.4//hadoop-yarn-common-2.7.4.jar
-hadoop-yarn-server-common/2.7.4//hadoop-yarn-server-common-2.7.4.jar
+hadoop-aliyun/3.3.2//hadoop-aliyun-3.3.2.jar
+had

[jira] [Assigned] (SPARK-39065) DS V2 Limit push-down should avoid out of memory

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39065:


Assignee: Apache Spark

> DS V2 Limit push-down should avoid out of memory
> 
>
> Key: SPARK-39065
> URL: https://issues.apache.org/jira/browse/SPARK-39065
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> Currently, Spark DS V2 supports push down Limit operator to data source.
> But the behavior only controlled by pushDownList option.
> If the limit is very large, then Executor will pull all the result set from 
> data source.
> So it will cause the memory issue as you know.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39065) DS V2 Limit push-down should avoid out of memory

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-39065:


Assignee: (was: Apache Spark)

> DS V2 Limit push-down should avoid out of memory
> 
>
> Key: SPARK-39065
> URL: https://issues.apache.org/jira/browse/SPARK-39065
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark DS V2 supports push down Limit operator to data source.
> But the behavior only controlled by pushDownList option.
> If the limit is very large, then Executor will pull all the result set from 
> data source.
> So it will cause the memory issue as you know.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-39065) DS V2 Limit push-down should avoid out of memory

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-39065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529860#comment-17529860
 ] 

Apache Spark commented on SPARK-39065:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/36405

> DS V2 Limit push-down should avoid out of memory
> 
>
> Key: SPARK-39065
> URL: https://issues.apache.org/jira/browse/SPARK-39065
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Currently, Spark DS V2 supports push down Limit operator to data source.
> But the behavior only controlled by pushDownList option.
> If the limit is very large, then Executor will pull all the result set from 
> data source.
> So it will cause the memory issue as you know.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38737) Test the error classes: INVALID_FIELD_NAME

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38737:


Assignee: (was: Apache Spark)

> Test the error classes: INVALID_FIELD_NAME
> --
>
> Key: SPARK-38737
> URL: https://issues.apache.org/jira/browse/SPARK-38737
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add tests for the error class *INVALID_FIELD_NAME* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def invalidFieldName(fieldName: Seq[String], path: Seq[String], context: 
> Origin): Throwable = {
> new AnalysisException(
>   errorClass = "INVALID_FIELD_NAME",
>   messageParameters = Array(fieldName.quoted, path.quoted),
>   origin = context)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38737) Test the error classes: INVALID_FIELD_NAME

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38737:


Assignee: Apache Spark

> Test the error classes: INVALID_FIELD_NAME
> --
>
> Key: SPARK-38737
> URL: https://issues.apache.org/jira/browse/SPARK-38737
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add tests for the error class *INVALID_FIELD_NAME* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def invalidFieldName(fieldName: Seq[String], path: Seq[String], context: 
> Origin): Throwable = {
> new AnalysisException(
>   errorClass = "INVALID_FIELD_NAME",
>   messageParameters = Array(fieldName.quoted, path.quoted),
>   origin = context)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-38737) Test the error classes: INVALID_FIELD_NAME

2022-04-29 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-38737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38737:


Assignee: Apache Spark

> Test the error classes: INVALID_FIELD_NAME
> --
>
> Key: SPARK-38737
> URL: https://issues.apache.org/jira/browse/SPARK-38737
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Add tests for the error class *INVALID_FIELD_NAME* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def invalidFieldName(fieldName: Seq[String], path: Seq[String], context: 
> Origin): Throwable = {
> new AnalysisException(
>   errorClass = "INVALID_FIELD_NAME",
>   messageParameters = Array(fieldName.quoted, path.quoted),
>   origin = context)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-38737) Test the error classes: INVALID_FIELD_NAME

2022-04-29 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-38737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529859#comment-17529859
 ] 

Apache Spark commented on SPARK-38737:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/36404

> Test the error classes: INVALID_FIELD_NAME
> --
>
> Key: SPARK-38737
> URL: https://issues.apache.org/jira/browse/SPARK-38737
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Max Gekk
>Priority: Minor
>  Labels: starter
>
> Add tests for the error class *INVALID_FIELD_NAME* to 
> QueryCompilationErrorsSuite. The test should cover the exception throw in 
> QueryCompilationErrors:
> {code:scala}
>   def invalidFieldName(fieldName: Seq[String], path: Seq[String], context: 
> Origin): Throwable = {
> new AnalysisException(
>   errorClass = "INVALID_FIELD_NAME",
>   messageParameters = Array(fieldName.quoted, path.quoted),
>   origin = context)
>   }
> {code}
> For example, here is a test for the error class *UNSUPPORTED_FEATURE*: 
> https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170
> +The test must have a check of:+
> # the entire error message
> # sqlState if it is defined in the error-classes.json file
> # the error class



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-39065) DS V2 Limit push-down should avoid out of memory

2022-04-29 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-39065:
--

 Summary: DS V2 Limit push-down should avoid out of memory
 Key: SPARK-39065
 URL: https://issues.apache.org/jira/browse/SPARK-39065
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: jiaan.geng


Currently, Spark DS V2 supports push down Limit operator to data source.
But the behavior only controlled by pushDownList option.
If the limit is very large, then Executor will pull all the result set from 
data source.
So it will cause the memory issue as you know.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-39035) Add tests for options from `to_csv` and `from_csv`.

2022-04-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-39035:


Assignee: Haejoon Lee

> Add tests for options from `to_csv` and `from_csv`.
> ---
>
> Key: SPARK-39035
> URL: https://issues.apache.org/jira/browse/SPARK-39035
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> There are many supported options for to_json and from_json 
> (https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option),
>  but they are currently not tested.
> We should test for options to 
> `sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-39035) Add tests for options from `to_csv` and `from_csv`.

2022-04-29 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-39035.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 36401
[https://github.com/apache/spark/pull/36401]

> Add tests for options from `to_csv` and `from_csv`.
> ---
>
> Key: SPARK-39035
> URL: https://issues.apache.org/jira/browse/SPARK-39035
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> There are many supported options for to_json and from_json 
> (https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option),
>  but they are currently not tested.
> We should test for options to 
> `sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala`.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

92 matches

Mail list logo