[jira] [Assigned] (SPARK-37938) Use error classes in the parsing errors of partitions
[ https://issues.apache.org/jira/browse/SPARK-37938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37938: Assignee: (was: Apache Spark) > Use error classes in the parsing errors of partitions > - > > Key: SPARK-37938 > URL: https://issues.apache.org/jira/browse/SPARK-37938 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * emptyPartitionKeyError > * partitionTransformNotExpectedError > * descColumnForPartitionUnsupportedError > * incompletePartitionSpecificationError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37938) Use error classes in the parsing errors of partitions
[ https://issues.apache.org/jira/browse/SPARK-37938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37938: Assignee: Apache Spark > Use error classes in the parsing errors of partitions > - > > Key: SPARK-37938 > URL: https://issues.apache.org/jira/browse/SPARK-37938 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * emptyPartitionKeyError > * partitionTransformNotExpectedError > * descColumnForPartitionUnsupportedError > * incompletePartitionSpecificationError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37938) Use error classes in the parsing errors of partitions
[ https://issues.apache.org/jira/browse/SPARK-37938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530322#comment-17530322 ] Apache Spark commented on SPARK-37938: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36416 > Use error classes in the parsing errors of partitions > - > > Key: SPARK-37938 > URL: https://issues.apache.org/jira/browse/SPARK-37938 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * emptyPartitionKeyError > * partitionTransformNotExpectedError > * descColumnForPartitionUnsupportedError > * incompletePartitionSpecificationError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37938) Use error classes in the parsing errors of partitions
[ https://issues.apache.org/jira/browse/SPARK-37938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530320#comment-17530320 ] Apache Spark commented on SPARK-37938: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36416 > Use error classes in the parsing errors of partitions > - > > Key: SPARK-37938 > URL: https://issues.apache.org/jira/browse/SPARK-37938 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Max Gekk >Priority: Major > > Migrate the following errors in QueryParsingErrors: > * emptyPartitionKeyError > * partitionTransformNotExpectedError > * descColumnForPartitionUnsupportedError > * incompletePartitionSpecificationError > onto use error classes. Throw an implementation of SparkThrowable. Also write > a test per every error in QueryParsingErrorsSuite. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39048) Refactor `GroupBy._reduce_for_stat_function` on accepted data types
[ https://issues.apache.org/jira/browse/SPARK-39048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin resolved SPARK-39048. --- Assignee: Xinrong Meng Resolution: Fixed Issue resolved by pull request 36382 https://github.com/apache/spark/pull/36382 > Refactor `GroupBy._reduce_for_stat_function` on accepted data types > > > Key: SPARK-39048 > URL: https://issues.apache.org/jira/browse/SPARK-39048 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > `Groupby._reduce_for_stat_function` is a common helper function leveraged by > multiple statistical functions of GroupBy objects. > It defines parameters `only_numeric` and `bool_as_numeric` to control > accepted Spark types. > To be consistent with pandas API, we may also have to introduce > `str_as_numeric` for `sum` for example. > Instead of introducing parameters designated for each Spark type, the PR is > proposed to introduce a parameter `accepted_spark_types` to specify accepted > types of Spark columns to be aggregated. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39078) Support UPDATE commands with DEFAULT values
[ https://issues.apache.org/jira/browse/SPARK-39078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39078: Assignee: Apache Spark > Support UPDATE commands with DEFAULT values > --- > > Key: SPARK-39078 > URL: https://issues.apache.org/jira/browse/SPARK-39078 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39078) Support UPDATE commands with DEFAULT values
[ https://issues.apache.org/jira/browse/SPARK-39078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530276#comment-17530276 ] Apache Spark commented on SPARK-39078: -- User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/36415 > Support UPDATE commands with DEFAULT values > --- > > Key: SPARK-39078 > URL: https://issues.apache.org/jira/browse/SPARK-39078 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39078) Support UPDATE commands with DEFAULT values
[ https://issues.apache.org/jira/browse/SPARK-39078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39078: Assignee: (was: Apache Spark) > Support UPDATE commands with DEFAULT values > --- > > Key: SPARK-39078 > URL: https://issues.apache.org/jira/browse/SPARK-39078 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39078) Support UPDATE commands with DEFAULT values
Daniel created SPARK-39078: -- Summary: Support UPDATE commands with DEFAULT values Key: SPARK-39078 URL: https://issues.apache.org/jira/browse/SPARK-39078 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Daniel -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39077) Implement `skipna` of basic statistical functions of DataFrame and Series
[ https://issues.apache.org/jira/browse/SPARK-39077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39077: Assignee: Apache Spark > Implement `skipna` of basic statistical functions of DataFrame and Series > - > > Key: SPARK-39077 > URL: https://issues.apache.org/jira/browse/SPARK-39077 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > > Implement `skipna` of basic statistical functions of DataFrame and Series -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39077) Implement `skipna` of basic statistical functions of DataFrame and Series
[ https://issues.apache.org/jira/browse/SPARK-39077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39077: Assignee: (was: Apache Spark) > Implement `skipna` of basic statistical functions of DataFrame and Series > - > > Key: SPARK-39077 > URL: https://issues.apache.org/jira/browse/SPARK-39077 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `skipna` of basic statistical functions of DataFrame and Series -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39077) Implement `skipna` of basic statistical functions of DataFrame and Series
[ https://issues.apache.org/jira/browse/SPARK-39077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530275#comment-17530275 ] Apache Spark commented on SPARK-39077: -- User 'xinrong-databricks' has created a pull request for this issue: https://github.com/apache/spark/pull/36414 > Implement `skipna` of basic statistical functions of DataFrame and Series > - > > Key: SPARK-39077 > URL: https://issues.apache.org/jira/browse/SPARK-39077 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Implement `skipna` of basic statistical functions of DataFrame and Series -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39077) Implement `skipna` of basic statistical functions of DataFrame and Series
Xinrong Meng created SPARK-39077: Summary: Implement `skipna` of basic statistical functions of DataFrame and Series Key: SPARK-39077 URL: https://issues.apache.org/jira/browse/SPARK-39077 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Implement `skipna` of basic statistical functions of DataFrame and Series -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39076) Standardize Statistical Functions of pandas API on Spark
Xinrong Meng created SPARK-39076: Summary: Standardize Statistical Functions of pandas API on Spark Key: SPARK-39076 URL: https://issues.apache.org/jira/browse/SPARK-39076 Project: Spark Issue Type: Umbrella Components: PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng Statistical functions are the most commonly-used functions in Data Engineering and Data Analysis. Spark and pandas provide statistical functions in the context of SQL and Data Science separately. pandas API on Spark implements the pandas API on top of Apache Spark. Although there may be semantic differences of certain functions due to the high cost of big data calculations, for example, median. We should still try to reach the parity from the API level. However, critical parameters, such as `skipna`, of statistical functions are missing of basic objects: DataFrame, Series, and Index are missing. There is even a larger gap between statistical functions of pandas-on-Spark GroupBy objects and those of pandas GroupBy objects. In addition, tests coverage is far from perfect. With statistical functions standardized, pandas API coverage will be increased since missing parameters will be implemented. That would further improve the user adoption. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39012) SparkSQL parse partition value does not support all data types
[ https://issues.apache.org/jira/browse/SPARK-39012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-39012: - Summary: SparkSQL parse partition value does not support all data types (was: SparkSQL infer schema does not support all data types) > SparkSQL parse partition value does not support all data types > -- > > Key: SPARK-39012 > URL: https://issues.apache.org/jira/browse/SPARK-39012 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Rui Wang >Priority: Major > > When Spark needs to infer schema, it needs to parse string to a type. Not all > data types are supported so far in this path. For example, binary is known to > not be supported. If a user uses binary column, and if the user does not use > a metastore, then SparkSQL could fall back to schema inference thus fail to > execute during table scan. This should be a bug as schema inference is > supported but some types are missing. > string might be converted to all types except ARRAY, MAP, STRUCT, etc. Also > because when converting from a string, small scale type won't be identified > if there is a larger scale type. For example, short and long > Based on Spark SQL data types: > https://spark.apache.org/docs/latest/sql-ref-datatypes.html, we can support > the following types: > BINARY > BOOLEAN > And there are two types that I am not sure if SparkSQL is supporting: > YearMonthIntervalType > DayTimeIntervalType -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530185#comment-17530185 ] Erik Krogen commented on SPARK-39075: - I'm not sure of the history here, but this might be intentional. In AvroSerializer, going from a Catalyst {{ShortType}} to an Avro INT is an upcast, and therefore safe (a short will always fit inside of an int). But in AvroDeserializer, going from an Avro INT to a Catalyst {{ShortType}} is a downcast, so it's not a safe conversion. We can see that similarly, we don't support double-to-float or long-to-int (though we also don't support float-to-double or int-to-long, for that matter). I personally wouldn't have a concern with adding the int-to-short/byte conversion as long as overflow detection is handled gracefully. cc [~cloud_fan] [~Gengliang.Wang] > IncompatibleSchemaException when selecting data from table stored from a > DataFrame in Avro format with BYTE/SHORT > - > > Key: SPARK-39075 > URL: https://issues.apache.org/jira/browse/SPARK-39075 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xsys >Priority: Major > > h3. Describe the bug > We are trying to save a table constructed through a DataFrame with the > {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as > part of the schema. > When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from > the table, we expect it to give back the inserted value. However, we instead > get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. > This appears to be caused by a missing case statement handling the {{(INT, > ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer > newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. > h3. To Reproduce > On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the > Avro package: > {code:java} > ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} > Execute the following: > {code:java} > import org.apache.spark.sql.Row > import org.apache.spark.sql.types._ > val schema = new StructType().add(StructField("c1", ShortType, true)) > val rdd = sc.parallelize(Seq(Row("-128".toShort))) > val df = spark.createDataFrame(rdd, schema) > df.write.mode("overwrite").format("avro").saveAsTable("t0") > spark.sql("select * from t0;").show(false){code} > Resulting error: > {code:java} > 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) > org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro > type {"type":"record","name":"topLevelRecord","fields":[ > {"name":"c1","type":["int","null"]} > ]} to SQL type STRUCT<`c1`: SMALLINT>. > at > org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) > > at > org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) > at > org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) > > at > org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) > > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) > > at > org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) > > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) > > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) > > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) > > at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) > > at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.sch
[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated SPARK-39075: Description: h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false){code} Resulting error: {code:java} 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[ {"name":"c1","type":["int","null"]} ]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'c1' to SQL field 'c1' because schema is incompatible (avroType = "int", sqlType = SMALLINT) at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356) at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84) ... 26 more {code} h3. Expected behavior & Possible Solution We expect the output to successfully select {{-128}}. We tried other formats like Parquet and the outcome is consistent with this expectation. In the [{{AvroSerializer newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sq
[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39075: - Description: h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false){code} Resulting error: {code:java} 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[ {"name":"c1","type":["int","null"]} ]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'c1' to SQL field 'c1' because schema is incompatible (avroType = "int", sqlType = SMALLINT) at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356) at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84) ... 26 more {code} h3. Expected behavior & Possible Solution We expect the output to successfully select {{{}-128{}}}. We tried other formats like Parquet and the outcome is consistent with this expectation. In the [{{AvroSerializer newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/Avr
[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39075: - Description: h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: {code:java} ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{code} Execute the following: {code:java} import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false){code} Resulting error: {code:java} 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[ {"name":"c1","type":["int","null"]} ]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'c1' to SQL field 'c1' because schema is incompatible (avroType = "int", sqlType = SMALLINT) at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356) at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84) ... 26 more {code} h3. Expected behavior & Possible Solution We expect the output to successfully select {{{}-128{}}}. We tried other formats like Parquet and the outcome is consistent with this expectation. In the [{{AvroSerializer newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/A
[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39075: - Description: h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}} Execute the following: import org.apache.spark.sql.Row import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false)\{{}} Resulting error: 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[ {"name":"c1","type":["int","null"]} ]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'c1' to SQL field 'c1' because schema is incompatible (avroType = "int", sqlType = SMALLINT) at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356) at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84) ... 26 more\{{}} h3. Expected behavior & Possible Solution We expect the output to successfully select {{{}-128{}}}. We tried other formats like Parquet and the outcome is consistent with this expectation. In the [{{AvroSerializer newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L114]{{{}, {{ByteTy
[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39075: - Description: h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}} Execute the following: import org.apache.spark.sql.\{Row, SparkSession} import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false)\{{}} Resulting error: 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[ {"name":"c1","type":["int","null"]} ]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'c1' to SQL field 'c1' because schema is incompatible (avroType = "int", sqlType = SMALLINT) at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356) at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84) ... 26 more\{{}} h3. Expected behavior & Possible Solution We expect the output to successfully select {{{}-128{}}}. We tried other formats like Parquet and the outcome is consistent with this expectation. In the [{{AvroSerializer newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L11
[jira] [Updated] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
[ https://issues.apache.org/jira/browse/SPARK-39075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] xsys updated SPARK-39075: - Description: h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{AvroDeserializer newWriter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1\{{}} Execute the following: import org.apache.spark.sql.\{Row, SparkSession} import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false)\{{}} Resulting error: 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type {"type":"record","name":"topLevelRecord","fields":[ {"name":"c1","type":["int","null"]} ]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro field 'c1' to SQL field 'c1' because schema is incompatible (avroType = "int", sqlType = SMALLINT) at org.apache.spark.sql.avro.AvroDeserializer.newWriter(AvroDeserializer.scala:321) at org.apache.spark.sql.avro.AvroDeserializer.getRecordWriter(AvroDeserializer.scala:356) at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:84) ... 26 more\{{}} h3. Expected behavior & Possible Solution We expect the output to successfully select {{{}-128{}}}. We tried other formats like Parquet and the outcome is consistent with this expectation. In the [{{AvroSerializer newConverter}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroSerializer.scala#L11
[jira] [Created] (SPARK-39075) IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT
xsys created SPARK-39075: Summary: IncompatibleSchemaException when selecting data from table stored from a DataFrame in Avro format with BYTE/SHORT Key: SPARK-39075 URL: https://issues.apache.org/jira/browse/SPARK-39075 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: xsys h3. Describe the bug We are trying to save a table constructed through a DataFrame with the {{Avro}} data format. The table contains {{ByteType}} or {{ShortType}} as part of the schema. When we {{INSERT}} some valid values (e.g. {{{}-128{}}}) and {{SELECT}} from the table, we expect it to give back the inserted value. However, we instead get an {{IncompatibleSchemaException}} from the {{{}AvroDeserializer{}}}. This appears to be caused by a missing case statement handling the {{(INT, ShortType)}} and {{(INT, ByteType)}} cases in [{{{}AvroDeserializer newWriter{}}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321][{{}}|https://github.com/apache/spark/blob/4f25b3f71238a00508a356591553f2dfa89f8290/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala#L321]. h3. To Reproduce On Spark 3.2.1 (commit {{{}4f25b3f712{}}}), using {{spark-shell}} with the Avro package: ./bin/spark-shell --packages org.apache.spark:spark-avro_2.12:3.2.1{{}} Execute the following: import org.apache.spark.sql.\{Row, SparkSession} import org.apache.spark.sql.types._ val schema = new StructType().add(StructField("c1", ShortType, true)) val rdd = sc.parallelize(Seq(Row("-128".toShort))) val df = spark.createDataFrame(rdd, schema) df.write.mode("overwrite").format("avro").saveAsTable("t0") spark.sql("select * from t0;").show(false){{}} Resulting error: 22/04/27 18:04:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 32) org.apache.spark.sql.avro.IncompatibleSchemaException: Cannot convert Avro type \{"type":"record","name":"topLevelRecord","fields":[{"name":"c1","type":["int","null"]}]} to SQL type STRUCT<`c1`: SMALLINT>. at org.apache.spark.sql.avro.AvroDeserializer.liftedTree1$1(AvroDeserializer.scala:102) at org.apache.spark.sql.avro.AvroDeserializer.(AvroDeserializer.scala:74) at org.apache.spark.sql.avro.AvroFileFormat$$anon$1.(AvroFileFormat.scala:143) at org.apache.spark.sql.avro.AvroFileFormat.$anonfun$buildReader$1(AvroFileFormat.scala:136) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148) at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:133) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:187) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:104) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131)
[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s
[ https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530083#comment-17530083 ] pralabhkumar commented on SPARK-25355: -- [~hyukjin.kwon] Can u please us or redirect us to someone who can help us on the above two comments . > Support --proxy-user for Spark on K8s > - > > Key: SPARK-25355 > URL: https://issues.apache.org/jira/browse/SPARK-25355 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Stavros Kontopoulos >Assignee: Pedro Rossi >Priority: Major > Fix For: 3.1.0 > > > SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition > needed is the support for proxy user. A proxy user is impersonated by a > superuser who executes operations on behalf of the proxy user. More on this: > [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html] > [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md] > This has been implemented for Yarn upstream and Spark on Mesos here: > [https://github.com/mesosphere/spark/pull/26] > [~ifilonenko] creating this issue according to our discussion. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39074) Fail on uploading test files, not when downloading them
[ https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530066#comment-17530066 ] Apache Spark commented on SPARK-39074: -- User 'EnricoMi' has created a pull request for this issue: https://github.com/apache/spark/pull/36413 > Fail on uploading test files, not when downloading them > --- > > Key: SPARK-39074 > URL: https://issues.apache.org/jira/browse/SPARK-39074 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Major > > The CI workflow "Report test results" fails when there are no artifacts to be > downloaded from the triggering workflow. In some situations, the triggering > workflow is not skipped, but all test jobs are skipped in case no code > changes are detected. > In that situation, no test files are uploaded, which makes the triggered > workflow fail. > Downloading no test files can have two reasons: > 1. No tests have been executed or no test files have been generated. > 2. No code has been built and tested deliberately. > You want to be notified in the first situation to fix the CI. Therefore, CI > should fail when code is built and tests are run but no test result files are > been found. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39074) Fail on uploading test files, not when downloading them
[ https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39074: Assignee: Apache Spark > Fail on uploading test files, not when downloading them > --- > > Key: SPARK-39074 > URL: https://issues.apache.org/jira/browse/SPARK-39074 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Assignee: Apache Spark >Priority: Major > > The CI workflow "Report test results" fails when there are no artifacts to be > downloaded from the triggering workflow. In some situations, the triggering > workflow is not skipped, but all test jobs are skipped in case no code > changes are detected. > In that situation, no test files are uploaded, which makes the triggered > workflow fail. > Downloading no test files can have two reasons: > 1. No tests have been executed or no test files have been generated. > 2. No code has been built and tested deliberately. > You want to be notified in the first situation to fix the CI. Therefore, CI > should fail when code is built and tests are run but no test result files are > been found. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39074) Fail on uploading test files, not when downloading them
[ https://issues.apache.org/jira/browse/SPARK-39074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39074: Assignee: (was: Apache Spark) > Fail on uploading test files, not when downloading them > --- > > Key: SPARK-39074 > URL: https://issues.apache.org/jira/browse/SPARK-39074 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Major > > The CI workflow "Report test results" fails when there are no artifacts to be > downloaded from the triggering workflow. In some situations, the triggering > workflow is not skipped, but all test jobs are skipped in case no code > changes are detected. > In that situation, no test files are uploaded, which makes the triggered > workflow fail. > Downloading no test files can have two reasons: > 1. No tests have been executed or no test files have been generated. > 2. No code has been built and tested deliberately. > You want to be notified in the first situation to fix the CI. Therefore, CI > should fail when code is built and tests are run but no test result files are > been found. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39074) Fail on uploading test files, not when downloading them
Enrico Minack created SPARK-39074: - Summary: Fail on uploading test files, not when downloading them Key: SPARK-39074 URL: https://issues.apache.org/jira/browse/SPARK-39074 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.4.0 Reporter: Enrico Minack The CI workflow "Report test results" fails when there are no artifacts to be downloaded from the triggering workflow. In some situations, the triggering workflow is not skipped, but all test jobs are skipped in case no code changes are detected. In that situation, no test files are uploaded, which makes the triggered workflow fail. Downloading no test files can have two reasons: 1. No tests have been executed or no test files have been generated. 2. No code has been built and tested deliberately. You want to be notified in the first situation to fix the CI. Therefore, CI should fail when code is built and tests are run but no test result files are been found. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39073) Keep rowCount after hive table partition pruning if table only have hive statistics
[ https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530055#comment-17530055 ] Apache Spark commented on SPARK-39073: -- User 'qiuliang988' has created a pull request for this issue: https://github.com/apache/spark/pull/36412 > Keep rowCount after hive table partition pruning if table only have hive > statistics > --- > > Key: SPARK-39073 > URL: https://issues.apache.org/jira/browse/SPARK-39073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: qiuliang >Priority: Major > > If the partitioned table only has hive generated statistics, and the > statistics are stored in partition properties. HiveTableRelation cannot > obtain rowCount because the table statistics do not exist. We can generate > rowCount based on the statistics of these pruned partitions -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39073) Keep rowCount after hive table partition pruning if table only have hive statistics
[ https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39073: Assignee: (was: Apache Spark) > Keep rowCount after hive table partition pruning if table only have hive > statistics > --- > > Key: SPARK-39073 > URL: https://issues.apache.org/jira/browse/SPARK-39073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: qiuliang >Priority: Major > > If the partitioned table only has hive generated statistics, and the > statistics are stored in partition properties. HiveTableRelation cannot > obtain rowCount because the table statistics do not exist. We can generate > rowCount based on the statistics of these pruned partitions -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39073) Keep rowCount after hive table partition pruning if table only have hive statistics
[ https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39073: Assignee: Apache Spark > Keep rowCount after hive table partition pruning if table only have hive > statistics > --- > > Key: SPARK-39073 > URL: https://issues.apache.org/jira/browse/SPARK-39073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: qiuliang >Assignee: Apache Spark >Priority: Major > > If the partitioned table only has hive generated statistics, and the > statistics are stored in partition properties. HiveTableRelation cannot > obtain rowCount because the table statistics do not exist. We can generate > rowCount based on the statistics of these pruned partitions -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39073) Keep rowCount after hive table partition pruning if table only have hive statistics
[ https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530054#comment-17530054 ] Apache Spark commented on SPARK-39073: -- User 'qiuliang988' has created a pull request for this issue: https://github.com/apache/spark/pull/36412 > Keep rowCount after hive table partition pruning if table only have hive > statistics > --- > > Key: SPARK-39073 > URL: https://issues.apache.org/jira/browse/SPARK-39073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: qiuliang >Priority: Major > > If the partitioned table only has hive generated statistics, and the > statistics are stored in partition properties. HiveTableRelation cannot > obtain rowCount because the table statistics do not exist. We can generate > rowCount based on the statistics of these pruned partitions -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25355) Support --proxy-user for Spark on K8s
[ https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530046#comment-17530046 ] Shrikant edited comment on SPARK-25355 at 4/29/22 3:13 PM: --- [~pedro.rossi] [~dongjoon] When we use the proxy-user parameter, access to Kerberized HDFS is not working for Spark on Kubernetes. It's because the proxy user doesn't have access to any delegation tokens. Was this functionality tested for Spark on Kubernetes when proxy-user support was added as part of this task? was (Author: JIRAUSER280449): [~pedro.rossi] [~dongjoon] When we use the proxy-user parameter, access to Kerberized HDFS is not working for Spark on Kubernetes. It's because the proxy user doesn't have access to any delegation tokens. Was this functionality tested for Spark on Kubernetes when this bug was fixed? > Support --proxy-user for Spark on K8s > - > > Key: SPARK-25355 > URL: https://issues.apache.org/jira/browse/SPARK-25355 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Stavros Kontopoulos >Assignee: Pedro Rossi >Priority: Major > Fix For: 3.1.0 > > > SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition > needed is the support for proxy user. A proxy user is impersonated by a > superuser who executes operations on behalf of the proxy user. More on this: > [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html] > [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md] > This has been implemented for Yarn upstream and Spark on Mesos here: > [https://github.com/mesosphere/spark/pull/26] > [~ifilonenko] creating this issue according to our discussion. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25355) Support --proxy-user for Spark on K8s
[ https://issues.apache.org/jira/browse/SPARK-25355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530046#comment-17530046 ] Shrikant commented on SPARK-25355: -- [~pedro.rossi] [~dongjoon] When we use the proxy-user parameter, access to Kerberized HDFS is not working for Spark on Kubernetes. It's because the proxy user doesn't have access to any delegation tokens. Was this functionality tested for Spark on Kubernetes when this bug was fixed? > Support --proxy-user for Spark on K8s > - > > Key: SPARK-25355 > URL: https://issues.apache.org/jira/browse/SPARK-25355 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.1.0 >Reporter: Stavros Kontopoulos >Assignee: Pedro Rossi >Priority: Major > Fix For: 3.1.0 > > > SPARK-23257 adds kerberized hdfs support for Spark on K8s. A major addition > needed is the support for proxy user. A proxy user is impersonated by a > superuser who executes operations on behalf of the proxy user. More on this: > [https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/Superusers.html] > [https://github.com/spark-notebook/spark-notebook/blob/master/docs/proxyuser_impersonation.md] > This has been implemented for Yarn upstream and Spark on Mesos here: > [https://github.com/mesosphere/spark/pull/26] > [~ifilonenko] creating this issue according to our discussion. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39073) Keep rowCount after hive table partition pruning if table only have hive statistics
[ https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qiuliang updated SPARK-39073: - Summary: Keep rowCount after hive table partition pruning if table only have hive statistics (was: keep rowCount after hive table partition pruning if table only have hive statistics) > Keep rowCount after hive table partition pruning if table only have hive > statistics > --- > > Key: SPARK-39073 > URL: https://issues.apache.org/jira/browse/SPARK-39073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: qiuliang >Priority: Major > > If the partitioned table only has hive generated statistics, and the > statistics are stored in partition properties. HiveTableRelation cannot > obtain rowCount because the table statistics do not exist. We can generate > rowCount based on the statistics of these pruned partitions -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39073) keep rowCount after hive table partition pruning if table only have hive statistics
[ https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qiuliang updated SPARK-39073: - Summary: keep rowCount after hive table partition pruning if table only have hive statistics (was: keep rowCount after hive table partition pruning if only have hive statistics) > keep rowCount after hive table partition pruning if table only have hive > statistics > --- > > Key: SPARK-39073 > URL: https://issues.apache.org/jira/browse/SPARK-39073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: qiuliang >Priority: Major > > If the partitioned table only has hive generated statistics, and the > statistics are stored in partition properties. HiveTableRelation cannot > obtain rowCount because the table statistics do not exist. We can generate > rowCount based on the statistics of these pruned partitions -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39073) keep rowCount after hive table partition pruning if only have hive statistics
[ https://issues.apache.org/jira/browse/SPARK-39073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] qiuliang updated SPARK-39073: - Issue Type: Improvement (was: Bug) > keep rowCount after hive table partition pruning if only have hive statistics > - > > Key: SPARK-39073 > URL: https://issues.apache.org/jira/browse/SPARK-39073 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: qiuliang >Priority: Major > > If the partitioned table only has hive generated statistics, and the > statistics are stored in partition properties. HiveTableRelation cannot > obtain rowCount because the table statistics do not exist. We can generate > rowCount based on the statistics of these pruned partitions -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39073) keep rowCount after hive table partition pruning if only have hive statistics
qiuliang created SPARK-39073: Summary: keep rowCount after hive table partition pruning if only have hive statistics Key: SPARK-39073 URL: https://issues.apache.org/jira/browse/SPARK-39073 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1 Reporter: qiuliang If the partitioned table only has hive generated statistics, and the statistics are stored in partition properties. HiveTableRelation cannot obtain rowCount because the table statistics do not exist. We can generate rowCount based on the statistics of these pruned partitions -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39072) Fast Fail the remaining push blocks if shuffle stage finalized
[ https://issues.apache.org/jira/browse/SPARK-39072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529992#comment-17529992 ] Apache Spark commented on SPARK-39072: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/36411 > Fast Fail the remaining push blocks if shuffle stage finalized > -- > > Key: SPARK-39072 > URL: https://issues.apache.org/jira/browse/SPARK-39072 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Minor > > Map task will try to push all map outputs to external shuffle service now. > After the shuffle stage is finalized, the reduce fetch blocks RPC will be > blocked if there are still many map output blocks in flight. > We could stop pushing the remaining blocks if the shuffle stage is finalized. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39072) Fast Fail the remaining push blocks if shuffle stage finalized
[ https://issues.apache.org/jira/browse/SPARK-39072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39072: Assignee: (was: Apache Spark) > Fast Fail the remaining push blocks if shuffle stage finalized > -- > > Key: SPARK-39072 > URL: https://issues.apache.org/jira/browse/SPARK-39072 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Minor > > Map task will try to push all map outputs to external shuffle service now. > After the shuffle stage is finalized, the reduce fetch blocks RPC will be > blocked if there are still many map output blocks in flight. > We could stop pushing the remaining blocks if the shuffle stage is finalized. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39072) Fast Fail the remaining push blocks if shuffle stage finalized
[ https://issues.apache.org/jira/browse/SPARK-39072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529990#comment-17529990 ] Apache Spark commented on SPARK-39072: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/36411 > Fast Fail the remaining push blocks if shuffle stage finalized > -- > > Key: SPARK-39072 > URL: https://issues.apache.org/jira/browse/SPARK-39072 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Minor > > Map task will try to push all map outputs to external shuffle service now. > After the shuffle stage is finalized, the reduce fetch blocks RPC will be > blocked if there are still many map output blocks in flight. > We could stop pushing the remaining blocks if the shuffle stage is finalized. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39072) Fast Fail the remaining push blocks if shuffle stage finalized
[ https://issues.apache.org/jira/browse/SPARK-39072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39072: Assignee: Apache Spark > Fast Fail the remaining push blocks if shuffle stage finalized > -- > > Key: SPARK-39072 > URL: https://issues.apache.org/jira/browse/SPARK-39072 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Minor > > Map task will try to push all map outputs to external shuffle service now. > After the shuffle stage is finalized, the reduce fetch blocks RPC will be > blocked if there are still many map output blocks in flight. > We could stop pushing the remaining blocks if the shuffle stage is finalized. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39069) Simplify another conditionals case in predicate
[ https://issues.apache.org/jira/browse/SPARK-39069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39069: Assignee: (was: Apache Spark) > Simplify another conditionals case in predicate > --- > > Key: SPARK-39069 > URL: https://issues.apache.org/jira/browse/SPARK-39069 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > sql( > """ > |CREATE TABLE t1 ( > | id DECIMAL(18,0), > | event_dt DATE, > | cmpgn_run_dt DATE) > |USING parquet > |PARTITIONED BY (cmpgn_run_dt) > """.stripMargin) > sql( > """ > |select count(*) > |from t1 > |where CMPGN_RUN_DT >= date_sub(EVENT_DT,2) and CMPGN_RUN_DT <= EVENT_DT > |and EVENT_DT ='2022-04-05' > |; > """.stripMargin).explain(true) > {code} > Excepted: > {noformat} > == Optimized Logical Plan == > Aggregate [count(1) AS count(1)#4L] > +- Project >+- Filter (((isnotnull(CMPGN_RUN_DT#3) AND (CMPGN_RUN_DT#3 >= 2022-04-03)) > AND (CMPGN_RUN_DT#3 <= 2022-04-05)) AND (EVENT_DT#2 = 2022-04-05)) > +- Relation default.t1[id#1,event_dt#2,cmpgn_run_dt#3] parquet > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#4L]) > +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#31] >+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#7L]) > +- *(1) Project > +- *(1) Filter (EVENT_DT#2 = 2022-04-05) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[event_dt#2,cmpgn_run_dt#3] > Batched: true, DataFilters: [(event_dt#2 = 2022-04-05)], Format: Parquet, > Location: InMemoryFileIndex[], PartitionFilters: [isnotnull(cmpgn_run_dt#3), > (cmpgn_run_dt#3 >= 2022-04-03), (cmpgn_run_dt#3 <= 2022-04-05)], > PushedFilters: [EqualTo(event_dt,2022-04-05)], ReadSchema: > struct, UsedIndexes: [] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39069) Simplify another conditionals case in predicate
[ https://issues.apache.org/jira/browse/SPARK-39069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39069: Assignee: Apache Spark > Simplify another conditionals case in predicate > --- > > Key: SPARK-39069 > URL: https://issues.apache.org/jira/browse/SPARK-39069 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > {code:scala} > sql( > """ > |CREATE TABLE t1 ( > | id DECIMAL(18,0), > | event_dt DATE, > | cmpgn_run_dt DATE) > |USING parquet > |PARTITIONED BY (cmpgn_run_dt) > """.stripMargin) > sql( > """ > |select count(*) > |from t1 > |where CMPGN_RUN_DT >= date_sub(EVENT_DT,2) and CMPGN_RUN_DT <= EVENT_DT > |and EVENT_DT ='2022-04-05' > |; > """.stripMargin).explain(true) > {code} > Excepted: > {noformat} > == Optimized Logical Plan == > Aggregate [count(1) AS count(1)#4L] > +- Project >+- Filter (((isnotnull(CMPGN_RUN_DT#3) AND (CMPGN_RUN_DT#3 >= 2022-04-03)) > AND (CMPGN_RUN_DT#3 <= 2022-04-05)) AND (EVENT_DT#2 = 2022-04-05)) > +- Relation default.t1[id#1,event_dt#2,cmpgn_run_dt#3] parquet > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#4L]) > +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#31] >+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#7L]) > +- *(1) Project > +- *(1) Filter (EVENT_DT#2 = 2022-04-05) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[event_dt#2,cmpgn_run_dt#3] > Batched: true, DataFilters: [(event_dt#2 = 2022-04-05)], Format: Parquet, > Location: InMemoryFileIndex[], PartitionFilters: [isnotnull(cmpgn_run_dt#3), > (cmpgn_run_dt#3 >= 2022-04-03), (cmpgn_run_dt#3 <= 2022-04-05)], > PushedFilters: [EqualTo(event_dt,2022-04-05)], ReadSchema: > struct, UsedIndexes: [] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39069) Simplify another conditionals case in predicate
[ https://issues.apache.org/jira/browse/SPARK-39069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529984#comment-17529984 ] Apache Spark commented on SPARK-39069: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/36410 > Simplify another conditionals case in predicate > --- > > Key: SPARK-39069 > URL: https://issues.apache.org/jira/browse/SPARK-39069 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > sql( > """ > |CREATE TABLE t1 ( > | id DECIMAL(18,0), > | event_dt DATE, > | cmpgn_run_dt DATE) > |USING parquet > |PARTITIONED BY (cmpgn_run_dt) > """.stripMargin) > sql( > """ > |select count(*) > |from t1 > |where CMPGN_RUN_DT >= date_sub(EVENT_DT,2) and CMPGN_RUN_DT <= EVENT_DT > |and EVENT_DT ='2022-04-05' > |; > """.stripMargin).explain(true) > {code} > Excepted: > {noformat} > == Optimized Logical Plan == > Aggregate [count(1) AS count(1)#4L] > +- Project >+- Filter (((isnotnull(CMPGN_RUN_DT#3) AND (CMPGN_RUN_DT#3 >= 2022-04-03)) > AND (CMPGN_RUN_DT#3 <= 2022-04-05)) AND (EVENT_DT#2 = 2022-04-05)) > +- Relation default.t1[id#1,event_dt#2,cmpgn_run_dt#3] parquet > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#4L]) > +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#31] >+- *(1) HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#7L]) > +- *(1) Project > +- *(1) Filter (EVENT_DT#2 = 2022-04-05) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[event_dt#2,cmpgn_run_dt#3] > Batched: true, DataFilters: [(event_dt#2 = 2022-04-05)], Format: Parquet, > Location: InMemoryFileIndex[], PartitionFilters: [isnotnull(cmpgn_run_dt#3), > (cmpgn_run_dt#3 >= 2022-04-03), (cmpgn_run_dt#3 <= 2022-04-05)], > PushedFilters: [EqualTo(event_dt,2022-04-05)], ReadSchema: > struct, UsedIndexes: [] > {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39072) Fast Fail the remaining push blocks if shuffle stage finalized
Wan Kun created SPARK-39072: --- Summary: Fast Fail the remaining push blocks if shuffle stage finalized Key: SPARK-39072 URL: https://issues.apache.org/jira/browse/SPARK-39072 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 3.3.0 Reporter: Wan Kun Map task will try to push all map outputs to external shuffle service now. After the shuffle stage is finalized, the reduce fetch blocks RPC will be blocked if there are still many map output blocks in flight. We could stop pushing the remaining blocks if the shuffle stage is finalized. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38729) Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK
[ https://issues.apache.org/jira/browse/SPARK-38729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529961#comment-17529961 ] Apache Spark commented on SPARK-38729: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36409 > Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK > - > > Key: SPARK-38729 > URL: https://issues.apache.org/jira/browse/SPARK-38729 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add at least one test for the error class > *FAILED_SET_ORIGINAL_PERMISSION_BACK* to QueryExecutionErrorsSuite. The test > should cover the exception throw in QueryExecutionErrors: > {code:scala} > def failToSetOriginalPermissionBackError( > permission: FsPermission, > path: Path, > e: Throwable): Throwable = { > new SparkSecurityException(errorClass = > "FAILED_SET_ORIGINAL_PERMISSION_BACK", > Array(permission.toString, path.toString, e.getMessage)) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38729) Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK
[ https://issues.apache.org/jira/browse/SPARK-38729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529960#comment-17529960 ] Apache Spark commented on SPARK-38729: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36409 > Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK > - > > Key: SPARK-38729 > URL: https://issues.apache.org/jira/browse/SPARK-38729 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add at least one test for the error class > *FAILED_SET_ORIGINAL_PERMISSION_BACK* to QueryExecutionErrorsSuite. The test > should cover the exception throw in QueryExecutionErrors: > {code:scala} > def failToSetOriginalPermissionBackError( > permission: FsPermission, > path: Path, > e: Throwable): Throwable = { > new SparkSecurityException(errorClass = > "FAILED_SET_ORIGINAL_PERMISSION_BACK", > Array(permission.toString, path.toString, e.getMessage)) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38729) Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK
[ https://issues.apache.org/jira/browse/SPARK-38729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38729: Assignee: (was: Apache Spark) > Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK > - > > Key: SPARK-38729 > URL: https://issues.apache.org/jira/browse/SPARK-38729 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add at least one test for the error class > *FAILED_SET_ORIGINAL_PERMISSION_BACK* to QueryExecutionErrorsSuite. The test > should cover the exception throw in QueryExecutionErrors: > {code:scala} > def failToSetOriginalPermissionBackError( > permission: FsPermission, > path: Path, > e: Throwable): Throwable = { > new SparkSecurityException(errorClass = > "FAILED_SET_ORIGINAL_PERMISSION_BACK", > Array(permission.toString, path.toString, e.getMessage)) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38729) Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK
[ https://issues.apache.org/jira/browse/SPARK-38729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38729: Assignee: Apache Spark > Test the error class: FAILED_SET_ORIGINAL_PERMISSION_BACK > - > > Key: SPARK-38729 > URL: https://issues.apache.org/jira/browse/SPARK-38729 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Minor > Labels: starter > > Add at least one test for the error class > *FAILED_SET_ORIGINAL_PERMISSION_BACK* to QueryExecutionErrorsSuite. The test > should cover the exception throw in QueryExecutionErrors: > {code:scala} > def failToSetOriginalPermissionBackError( > permission: FsPermission, > path: Path, > e: Throwable): Throwable = { > new SparkSecurityException(errorClass = > "FAILED_SET_ORIGINAL_PERMISSION_BACK", > Array(permission.toString, path.toString, e.getMessage)) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException
[ https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Altenberg updated SPARK-39070: Component/s: PySpark > Pivoting a dataframe after two explodes raises an > org.codehaus.commons.compiler.CompileException > > > Key: SPARK-39070 > URL: https://issues.apache.org/jira/browse/SPARK-39070 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > MacOS 12.3.1, python versions 3.8.10, 3.9.12 > Ubuntu 20.04, python version 3.8.10 > Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda > and pip >Reporter: Felix Altenberg >Priority: Major > Attachments: spark_pivot_error.txt > > > The following Code raises an exception starting in spark version 3.2.0 > {code:java} > import pyspark.sql.functions as sf > from pyspark.sql import Row > data = [ > Row( > other_value=10, > ships=[ > Row( > fields=[ > Row(name="field1", value=1), > Row(name="field2", value=2), > ] > ), > Row( > fields=[ > Row(name="field1", value=3), > Row(name="field3", value=4), > ] > ), > ], > ) > ] > df = spark.createDataFrame(data) > df = df.withColumn("ships", sf.explode("ships")).select( > sf.col("other_value"), sf.col("ships.*") > ) > df = df.withColumn("fields", sf.explode("fields")).select("other_value", > "fields.*") > df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} > The "pivot("name")" is what causes the error to be thrown. > Having even one less explode (by manually transforming the input dataframe to > what it would look like after the first explode) causes this error not to > appear. > Adding a .cache() before the groupBy and pivot also causes this error to > dissapear. > Strangely, this error does not appear when running the above code in > Databricks using Databricks Runtime 10.4 which is also using spark version > 3.2.1 > I have attached the full stacktrace of the error below. > I will try to investigate more after work today and will add anything else I > find. > The exception raised is > {noformat} > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 61, Column 78: A method named "numElements" is not declared in any enclosing > class nor any supertype, nor through a static import{noformat} > but seems to be caused by this exception > {noformat} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to > org.apache.spark.sql.catalyst.InternalRow{noformat} > The following stackoverflow question seems to pertain to a very similar error > [https://stackoverflow.com/questions/70356375/pyspark-pivot-unsafearraydata-cannot-be-cast-to-internalrow] -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException
[ https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Altenberg updated SPARK-39070: Description: The following Code raises an exception starting in spark version 3.2.0 {code:java} import pyspark.sql.functions as sf from pyspark.sql import Row data = [ Row( other_value=10, ships=[ Row( fields=[ Row(name="field1", value=1), Row(name="field2", value=2), ] ), Row( fields=[ Row(name="field1", value=3), Row(name="field3", value=4), ] ), ], ) ] df = spark.createDataFrame(data) df = df.withColumn("ships", sf.explode("ships")).select( sf.col("other_value"), sf.col("ships.*") ) df = df.withColumn("fields", sf.explode("fields")).select("other_value", "fields.*") df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} The "pivot("name")" is what causes the error to be thrown. Having even one less explode (by manually transforming the input dataframe to what it would look like after the first explode) causes this error not to appear. Adding a .cache() before the groupBy and pivot also causes this error to dissapear. Strangely, this error does not appear when running the above code in Databricks using Databricks Runtime 10.4 which is also using spark version 3.2.1 I have attached the full stacktrace of the error below. I will try to investigate more after work today and will add anything else I find. The exception raised is {noformat} org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 61, Column 78: A method named "numElements" is not declared in any enclosing class nor any supertype, nor through a static import{noformat} but seems to be caused by this exception {noformat} Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.UnsafeArrayData cannot be cast to org.apache.spark.sql.catalyst.InternalRow{noformat} The following stackoverflow question seems to pertain to a very similar error [https://stackoverflow.com/questions/70356375/pyspark-pivot-unsafearraydata-cannot-be-cast-to-internalrow] was: The following Code raises an exception starting in spark version 3.2.0 {code:java} import pyspark.sql.functions as sf from pyspark.sql import Row data = [ Row( other_value=10, ships=[ Row( fields=[ Row(name="field1", value=1), Row(name="field2", value=2), ] ), Row( fields=[ Row(name="field1", value=3), Row(name="field3", value=4), ] ), ], ) ] df = spark.createDataFrame(data) df = df.withColumn("ships", sf.explode("ships")).select( sf.col("other_value"), sf.col("ships.*") ) df = df.withColumn("fields", sf.explode("fields")).select("other_value", "fields.*") df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} The "pivot("name")" is what causes the error to be thrown. Having even one less explode (by manually transforming the input dataframe to what it would look like after the first explode) causes this error not to appear. Adding a .cache() before the groupBy and pivot also causes this error to dissapear. Strangely, this error does not appear when running the above code in Databricks using Databricks Runtime 10.4 which is also using spark version 3.2.1 I have attached the full stacktrace of the error below. I will try to investigate more after work today and will add anything else I find. The following stackoverflow question seems to pertain to a very similar error [https://stackoverflow.com/questions/70356375/pyspark-pivot-unsafearraydata-cannot-be-cast-to-internalrow] > Pivoting a dataframe after two explodes raises an > org.codehaus.commons.compiler.CompileException > > > Key: SPARK-39070 > URL: https://issues.apache.org/jira/browse/SPARK-39070 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > MacOS 12.3.1, python versions 3.8.10, 3.9.12 > Ubuntu 20.04, python version 3.8.10 > Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda > and pip >Reporter: Felix Altenberg >Priority: Major > Attachments: spark_pivot_error.txt > > > The following Code raises an exception starting in spark version 3.2.0 > {code:java} > import pyspark.sql.functions as sf > from pyspark.sql import Row > data = [ > Row( > other_value=10, > ships=[ > Ro
[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException
[ https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Altenberg updated SPARK-39070: Description: The following Code raises an exception starting in spark version 3.2.0 {code:java} import pyspark.sql.functions as sf from pyspark.sql import Row data = [ Row( other_value=10, ships=[ Row( fields=[ Row(name="field1", value=1), Row(name="field2", value=2), ] ), Row( fields=[ Row(name="field1", value=3), Row(name="field3", value=4), ] ), ], ) ] df = spark.createDataFrame(data) df = df.withColumn("ships", sf.explode("ships")).select( sf.col("other_value"), sf.col("ships.*") ) df = df.withColumn("fields", sf.explode("fields")).select("other_value", "fields.*") df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} The "pivot("name")" is what causes the error to be thrown. Having even one less explode (by manually transforming the input dataframe to what it would look like after the first explode) causes this error not to appear. Adding a .cache() before the groupBy and pivot also causes this error to dissapear. Strangely, this error does not appear when running the above code in Databricks using Databricks Runtime 10.4 which is also using spark version 3.2.1 I have attached the full stacktrace of the error below. I will try to investigate more after work today and will add anything else I find. The following stackoverflow question seems to pertain to a very similar error [https://stackoverflow.com/questions/70356375/pyspark-pivot-unsafearraydata-cannot-be-cast-to-internalrow] was: The following Code raises an exception starting in spark version 3.2.0 {code:java} import pyspark.sql.functions as sf from pyspark.sql import Row data = [ Row( other_value=10, ships=[ Row( fields=[ Row(name="field1", value=1), Row(name="field2", value=2), ] ), Row( fields=[ Row(name="field1", value=3), Row(name="field3", value=4), ] ), ], ) ] df = spark.createDataFrame(data) df = df.withColumn("ships", sf.explode("ships")).select( sf.col("other_value"), sf.col("ships.*") ) df = df.withColumn("fields", sf.explode("fields")).select("other_value", "fields.*") df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} The "pivot("name")" is what causes the error to be thrown. Having even one less explode (by manually transforming the input dataframe to what it would look like after the first explode) causes this error not to appear. Adding a .cache() before the groupBy and pivot also causes this error to dissapear. Strangely, this error does not appear when running the above code in Databricks using Databricks Runtime 10.4 which is also using spark version 3.2.1 I have attached the full stacktrace of the error below. I will try to investigate more after work today and will add anything else I find. > Pivoting a dataframe after two explodes raises an > org.codehaus.commons.compiler.CompileException > > > Key: SPARK-39070 > URL: https://issues.apache.org/jira/browse/SPARK-39070 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > MacOS 12.3.1, python versions 3.8.10, 3.9.12 > Ubuntu 20.04, python version 3.8.10 > Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda > and pip >Reporter: Felix Altenberg >Priority: Major > Attachments: spark_pivot_error.txt > > > The following Code raises an exception starting in spark version 3.2.0 > {code:java} > import pyspark.sql.functions as sf > from pyspark.sql import Row > data = [ > Row( > other_value=10, > ships=[ > Row( > fields=[ > Row(name="field1", value=1), > Row(name="field2", value=2), > ] > ), > Row( > fields=[ > Row(name="field1", value=3), > Row(name="field3", value=4), > ] > ), > ], > ) > ] > df = spark.createDataFrame(data) > df = df.withColumn("ships", sf.explode("ships")).select( > sf.col("other_value"), sf.col("ships.*") > ) > df = df.withColumn("fields", sf.explode("fields")).select("other_value", > "fields.*") > df = df.groupBy("other_value").pivot("na
[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException
[ https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Altenberg updated SPARK-39070: Description: The following Code raises an exception starting in spark version 3.2.0 {code:java} import pyspark.sql.functions as sf from pyspark.sql import Row data = [ Row( other_value=10, ships=[ Row( fields=[ Row(name="field1", value=1), Row(name="field2", value=2), ] ), Row( fields=[ Row(name="field1", value=3), Row(name="field3", value=4), ] ), ], ) ] df = spark.createDataFrame(data) df = df.withColumn("ships", sf.explode("ships")).select( sf.col("other_value"), sf.col("ships.*") ) df = df.withColumn("fields", sf.explode("fields")).select("other_value", "fields.*") df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} The "pivot("name")" is what causes the error to be thrown. Having even one less explode (by manually transforming the input dataframe to what it would look like after the first explode) causes this error not to appear. Adding a .cache() before the groupBy and pivot also causes this error to dissapear. Strangely, this error does not appear when running the above code in Databricks using Databricks Runtime 10.4 which is also using spark version 3.2.1 I have attached the full stacktrace of the error below. I will try to investigate more after work today and will add anything else I find. was: The following Code raises an exception starting in spark version 3.2.0 {code:java} import pyspark.sql.functions as sf from pyspark.sql import Row data = [ Row( other_value=10, ships=[ Row( fields=[ Row(name="field1", value=1), Row(name="field2", value=2), ] ), Row( fields=[ Row(name="field1", value=3), Row(name="field3", value=4), ] ), ], ) ] df = spark.createDataFrame(data) df = df.withColumn("ships", sf.explode("ships")).select( sf.col("other_value"), sf.col("ships.*") ) df = df.withColumn("fields", sf.explode("fields")).select("other_value", "fields.*") df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} The "pivot("name")" is what causes the error to be thrown. Having even one less explode (by manually transforming the input dataframe to what it would look like after the first explode) causes this error not to appear. Adding a .cache() before the groupBy and pivot also causes this error to dissapear. I have attached the full stacktrace of the error below. I will try to investigate more after work today and will add anything else I find. > Pivoting a dataframe after two explodes raises an > org.codehaus.commons.compiler.CompileException > > > Key: SPARK-39070 > URL: https://issues.apache.org/jira/browse/SPARK-39070 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > MacOS 12.3.1, python versions 3.8.10, 3.9.12 > Ubuntu 20.04, python version 3.8.10 > Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda > and pip >Reporter: Felix Altenberg >Priority: Major > Attachments: spark_pivot_error.txt > > > The following Code raises an exception starting in spark version 3.2.0 > {code:java} > import pyspark.sql.functions as sf > from pyspark.sql import Row > data = [ > Row( > other_value=10, > ships=[ > Row( > fields=[ > Row(name="field1", value=1), > Row(name="field2", value=2), > ] > ), > Row( > fields=[ > Row(name="field1", value=3), > Row(name="field3", value=4), > ] > ), > ], > ) > ] > df = spark.createDataFrame(data) > df = df.withColumn("ships", sf.explode("ships")).select( > sf.col("other_value"), sf.col("ships.*") > ) > df = df.withColumn("fields", sf.explode("fields")).select("other_value", > "fields.*") > df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} > The "pivot("name")" is what causes the error to be thrown. > Having even one less explode (by manually transforming the input dataframe to > what it would look like after the first explode) causes this error not to > appear. > Adding a .cache() before the groupBy and pivot also causes this error
[jira] [Commented] (SPARK-39071) Add unwrap_udt function for unwrapping UserDefinedType columns
[ https://issues.apache.org/jira/browse/SPARK-39071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529933#comment-17529933 ] Apache Spark commented on SPARK-39071: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/36408 > Add unwrap_udt function for unwrapping UserDefinedType columns > -- > > Key: SPARK-39071 > URL: https://issues.apache.org/jira/browse/SPARK-39071 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Priority: Major > > Add unwrap_udt function for unwrapping UserDefinedType columns. > This is useful in opensource project https://github.com/mengxr/pyspark-xgboost -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39071) Add unwrap_udt function for unwrapping UserDefinedType columns
[ https://issues.apache.org/jira/browse/SPARK-39071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39071: Assignee: (was: Apache Spark) > Add unwrap_udt function for unwrapping UserDefinedType columns > -- > > Key: SPARK-39071 > URL: https://issues.apache.org/jira/browse/SPARK-39071 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Priority: Major > > Add unwrap_udt function for unwrapping UserDefinedType columns. > This is useful in opensource project https://github.com/mengxr/pyspark-xgboost -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39071) Add unwrap_udt function for unwrapping UserDefinedType columns
[ https://issues.apache.org/jira/browse/SPARK-39071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39071: Assignee: Apache Spark > Add unwrap_udt function for unwrapping UserDefinedType columns > -- > > Key: SPARK-39071 > URL: https://issues.apache.org/jira/browse/SPARK-39071 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Major > > Add unwrap_udt function for unwrapping UserDefinedType columns. > This is useful in opensource project https://github.com/mengxr/pyspark-xgboost -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39071) Add unwrap_udt function for unwrapping UserDefinedType columns
[ https://issues.apache.org/jira/browse/SPARK-39071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529932#comment-17529932 ] Apache Spark commented on SPARK-39071: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/36408 > Add unwrap_udt function for unwrapping UserDefinedType columns > -- > > Key: SPARK-39071 > URL: https://issues.apache.org/jira/browse/SPARK-39071 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Priority: Major > > Add unwrap_udt function for unwrapping UserDefinedType columns. > This is useful in opensource project https://github.com/mengxr/pyspark-xgboost -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39071) Add unwrap_udt function for unwrapping UserDefinedType columns
Weichen Xu created SPARK-39071: -- Summary: Add unwrap_udt function for unwrapping UserDefinedType columns Key: SPARK-39071 URL: https://issues.apache.org/jira/browse/SPARK-39071 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: Weichen Xu Add unwrap_udt function for unwrapping UserDefinedType columns. This is useful in opensource project https://github.com/mengxr/pyspark-xgboost -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39064) FileNotFoundException while reading from Kafka
[ https://issues.apache.org/jira/browse/SPARK-39064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sushmita chauhan updated SPARK-39064: - Description: We are running a stateful structured streaming job which reads from Kafka and writes to HDFS. And we are hitting this exception: 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): java.lang.IllegalStateException: Error reading delta file /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta does not exist at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) at scala.Option.getOrElse(Option.scala:121 Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory there is no file at all. While we have some files in the commits and offsets folders. I am not sure about the reason of this behavior. It seems to happen on the second time the job is started, If any more information is needed then I would be happy to provide it. Would also appreciate on some input on how to best resolve this issue? was: We are running a stateful structured streaming job which reads from Kafka and writes to HDFS. And we are hitting this exception: 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): java.lang.IllegalStateException: Error reading delta file /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta does not exist at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBacked
[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException
[ https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Altenberg updated SPARK-39070: Environment: Tested on: MacOS 12.3.1, python versions 3.8.10, 3.9.12 Ubuntu 20.04, python version 3.8.10 Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda and pip was: Tested on: MacOS 12.3.1, python versions 3.8.10, 3.9.12 Ubuntu 20.04, python version 3.8.10 Tested spark versions 3.2.1 and 3.2.0 installed through conda and pip > Pivoting a dataframe after two explodes raises an > org.codehaus.commons.compiler.CompileException > > > Key: SPARK-39070 > URL: https://issues.apache.org/jira/browse/SPARK-39070 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > MacOS 12.3.1, python versions 3.8.10, 3.9.12 > Ubuntu 20.04, python version 3.8.10 > Tested spark versions 3.1.2, 3.1.3, 3.2.1 and 3.2.0 installed through conda > and pip >Reporter: Felix Altenberg >Priority: Major > Attachments: spark_pivot_error.txt > > > The following Code raises an exception starting in spark version 3.2.0 > {code:java} > import pyspark.sql.functions as sf > from pyspark.sql import Row > data = [ > Row( > other_value=10, > ships=[ > Row( > fields=[ > Row(name="field1", value=1), > Row(name="field2", value=2), > ] > ), > Row( > fields=[ > Row(name="field1", value=3), > Row(name="field3", value=4), > ] > ), > ], > ) > ] > df = spark.createDataFrame(data) > df = df.withColumn("ships", sf.explode("ships")).select( > sf.col("other_value"), sf.col("ships.*") > ) > df = df.withColumn("fields", sf.explode("fields")).select("other_value", > "fields.*") > df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} > The "pivot("name")" is what causes the error to be thrown. > Having even one less explode (by manually transforming the input dataframe to > what it would look like after the first explode) causes this error not to > appear. > Adding a .cache() before the groupBy and pivot also causes this error to > dissapear. > I have attached the full stacktrace of the error below. > I will try to investigate more after work today and will add anything else I > find. > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException
[ https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Altenberg updated SPARK-39070: Description: The following Code raises an exception starting in spark version 3.2.0 {code:java} import pyspark.sql.functions as sf from pyspark.sql import Row data = [ Row( other_value=10, ships=[ Row( fields=[ Row(name="field1", value=1), Row(name="field2", value=2), ] ), Row( fields=[ Row(name="field1", value=3), Row(name="field3", value=4), ] ), ], ) ] df = spark.createDataFrame(data) df = df.withColumn("ships", sf.explode("ships")).select( sf.col("other_value"), sf.col("ships.*") ) df = df.withColumn("fields", sf.explode("fields")).select("other_value", "fields.*") df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} The "pivot("name")" is what causes the error to be thrown. Having even one less explode (by manually transforming the input dataframe to what it would look like after the first explode) causes this error not to appear. Adding a .cache() before the groupBy and pivot also causes this error to dissapear. I have attached the full stacktrace of the error below. I will try to investigate more after work today and will add anything else I find. was: The following Code raises an exception starting in spark version 3.2.0 {code:java} import pyspark.sql.functions as sf from pyspark.sql import Row data = [ Row( other_value=10, ships=[ Row( fields=[ Row(name="field1", value=1), Row(name="field2", value=2), ] ), Row( fields=[ Row(name="field1", value=3), Row(name="field3", value=4), ] ), ], ) ] df = spark.createDataFrame(data) df = df.withColumn("ships", sf.explode("ships")).select( sf.col("other_value"), sf.col("ships.*") ) df = df.withColumn("fields", sf.explode("fields")).select("other_value", "fields.*") df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} The "pivot("name")" is what causes the error to be thrown. I have attached the full stacktrace of the error below. I will try to investigate more after work today and will add anything else I find. > Pivoting a dataframe after two explodes raises an > org.codehaus.commons.compiler.CompileException > > > Key: SPARK-39070 > URL: https://issues.apache.org/jira/browse/SPARK-39070 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > MacOS 12.3.1, python versions 3.8.10, 3.9.12 > Ubuntu 20.04, python version 3.8.10 > Tested spark versions 3.2.1 and 3.2.0 installed through conda and pip >Reporter: Felix Altenberg >Priority: Major > Attachments: spark_pivot_error.txt > > > The following Code raises an exception starting in spark version 3.2.0 > {code:java} > import pyspark.sql.functions as sf > from pyspark.sql import Row > data = [ > Row( > other_value=10, > ships=[ > Row( > fields=[ > Row(name="field1", value=1), > Row(name="field2", value=2), > ] > ), > Row( > fields=[ > Row(name="field1", value=3), > Row(name="field3", value=4), > ] > ), > ], > ) > ] > df = spark.createDataFrame(data) > df = df.withColumn("ships", sf.explode("ships")).select( > sf.col("other_value"), sf.col("ships.*") > ) > df = df.withColumn("fields", sf.explode("fields")).select("other_value", > "fields.*") > df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} > The "pivot("name")" is what causes the error to be thrown. > Having even one less explode (by manually transforming the input dataframe to > what it would look like after the first explode) causes this error not to > appear. > Adding a .cache() before the groupBy and pivot also causes this error to > dissapear. > I have attached the full stacktrace of the error below. > I will try to investigate more after work today and will add anything else I > find. > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.a
[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException
[ https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Altenberg updated SPARK-39070: Description: The following Code raises an exception starting in spark version 3.2.0 {code:java} import pyspark.sql.functions as sf from pyspark.sql import Row data = [ Row( other_value=10, ships=[ Row( fields=[ Row(name="field1", value=1), Row(name="field2", value=2), ] ), Row( fields=[ Row(name="field1", value=3), Row(name="field3", value=4), ] ), ], ) ] df = spark.createDataFrame(data) df = df.withColumn("ships", sf.explode("ships")).select( sf.col("other_value"), sf.col("ships.*") ) df = df.withColumn("fields", sf.explode("fields")).select("other_value", "fields.*") df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} The "pivot("name")" is what causes the error to be thrown. I have attached the full stacktrace of the error below. I will try to investigate more after work today and will add anything else I find. was: The following Code raises an exception starting in spark version 3.2.0 {code:java} import pyspark.sql.functions as sf from pyspark.sql import Row data = [ Row( other_value=10, ships=[ Row( fields=[ Row(name="field1", value=1), Row(name="field2", value=2), ] ), Row( fields=[ Row(name="field1", value=3), Row(name="field3", value=4), ] ), ], ) ] df = spark.createDataFrame(data) df = df.withColumn("ships", sf.explode("ships")).select( sf.col("other_value"), sf.col("ships.*") ) df = df.withColumn("fields", sf.explode("fields")).select("other_value", "fields.*") df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} The "pivot("name")" is what causes the error to be thrown. I'll try to attach the full error message. > Pivoting a dataframe after two explodes raises an > org.codehaus.commons.compiler.CompileException > > > Key: SPARK-39070 > URL: https://issues.apache.org/jira/browse/SPARK-39070 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > MacOS 12.3.1, python versions 3.8.10, 3.9.12 > Ubuntu 20.04, python version 3.8.10 > Tested spark versions 3.2.1 and 3.2.0 installed through conda and pip >Reporter: Felix Altenberg >Priority: Major > Attachments: spark_pivot_error.txt > > > The following Code raises an exception starting in spark version 3.2.0 > {code:java} > import pyspark.sql.functions as sf > from pyspark.sql import Row > data = [ > Row( > other_value=10, > ships=[ > Row( > fields=[ > Row(name="field1", value=1), > Row(name="field2", value=2), > ] > ), > Row( > fields=[ > Row(name="field1", value=3), > Row(name="field3", value=4), > ] > ), > ], > ) > ] > df = spark.createDataFrame(data) > df = df.withColumn("ships", sf.explode("ships")).select( > sf.col("other_value"), sf.col("ships.*") > ) > df = df.withColumn("fields", sf.explode("fields")).select("other_value", > "fields.*") > df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} > The "pivot("name")" is what causes the error to be thrown. > I have attached the full stacktrace of the error below. > I will try to investigate more after work today and will add anything else I > find. > > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException
[ https://issues.apache.org/jira/browse/SPARK-39070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Altenberg updated SPARK-39070: Attachment: spark_pivot_error.txt > Pivoting a dataframe after two explodes raises an > org.codehaus.commons.compiler.CompileException > > > Key: SPARK-39070 > URL: https://issues.apache.org/jira/browse/SPARK-39070 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1 > Environment: Tested on: > MacOS 12.3.1, python versions 3.8.10, 3.9.12 > Ubuntu 20.04, python version 3.8.10 > Tested spark versions 3.2.1 and 3.2.0 installed through conda and pip >Reporter: Felix Altenberg >Priority: Major > Attachments: spark_pivot_error.txt > > > The following Code raises an exception starting in spark version 3.2.0 > {code:java} > import pyspark.sql.functions as sf > from pyspark.sql import Row > data = [ > Row( > other_value=10, > ships=[ > Row( > fields=[ > Row(name="field1", value=1), > Row(name="field2", value=2), > ] > ), > Row( > fields=[ > Row(name="field1", value=3), > Row(name="field3", value=4), > ] > ), > ], > ) > ] > df = spark.createDataFrame(data) > df = df.withColumn("ships", sf.explode("ships")).select( > sf.col("other_value"), sf.col("ships.*") > ) > df = df.withColumn("fields", sf.explode("fields")).select("other_value", > "fields.*") > df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} > The "pivot("name")" is what causes the error to be thrown. > I'll try to attach the full error message. > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39070) Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException
Felix Altenberg created SPARK-39070: --- Summary: Pivoting a dataframe after two explodes raises an org.codehaus.commons.compiler.CompileException Key: SPARK-39070 URL: https://issues.apache.org/jira/browse/SPARK-39070 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.1, 3.2.0 Environment: Tested on: MacOS 12.3.1, python versions 3.8.10, 3.9.12 Ubuntu 20.04, python version 3.8.10 Tested spark versions 3.2.1 and 3.2.0 installed through conda and pip Reporter: Felix Altenberg The following Code raises an exception starting in spark version 3.2.0 {code:java} import pyspark.sql.functions as sf from pyspark.sql import Row data = [ Row( other_value=10, ships=[ Row( fields=[ Row(name="field1", value=1), Row(name="field2", value=2), ] ), Row( fields=[ Row(name="field1", value=3), Row(name="field3", value=4), ] ), ], ) ] df = spark.createDataFrame(data) df = df.withColumn("ships", sf.explode("ships")).select( sf.col("other_value"), sf.col("ships.*") ) df = df.withColumn("fields", sf.explode("fields")).select("other_value", "fields.*") df = df.groupBy("other_value").pivot("name") #ERROR THROWN HERE {code} The "pivot("name")" is what causes the error to be thrown. I'll try to attach the full error message. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39064) FileNotFoundException while reading from Kafka
[ https://issues.apache.org/jira/browse/SPARK-39064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529922#comment-17529922 ] Sushmita chauhan commented on SPARK-39064: -- Hi, I m using Spark structured Streaming and spark sql. I m reading data through kafka with spark windowing operation and union the kafka and hive data , make single dataframe . m trying to find latest data with timestamp through spark window function and row_number then save dataframe in hive table. val sparkSession: SparkSession = SparkSession.builder().appName(SPARK_APP_NAME) .config("hive.metastore.warehouse.dir", SPARK_HIVE_WAREHOUSE_DIR_SUSPICIOUS_USER) // .config("spark.default.parallelism",70) .config("hive.exec.dynamic.partition.mode", "nonstrict") .config("hive.support.concurrency", true) .config("spark.sql.orc.impl","native") .config("spark.speculation","false") query = kakfa_SuspciousUser .selectExpr( "nullif(TimeStamp, \"\") as TimeStamp", "nullif(UserName, \"\") as UserName" ) .groupBy(window(col("TimeStamp"), "30 minute"), col("UserName")) .agg(functions.count("UserName").alias("user_UserCount")) .select("UserName","window.end", "user_UserCount") .filter(col("user_UserCount").gt(4)) .writeStream .foreachBatch(function = (batchDF: Dataset[Row], batchId: Long) => { println("drop table if exists CompromiseUserID_transaction") val hive_target_Table = hiveCache.sparkSession.table("User_temp_table") println("before hive") /* WRITE FINAL DATASET IN SuspiciousIP HIVE TABLE */ hive_target_Table.write.mode(SaveMode.Append).format("hive").saveAsTable("parichaynew.CompromiseUserIDTable") }) .outputMode("update").trigger(Trigger.ProcessingTime(10, TimeUnit.SECONDS)) .option("checkpointLocation",CHECK_POINT) .start() query.awaitTermination() // This is my code . Please help to find the issue . My streaming code running 5 hours continueously after 5 hours The issue starts with the usual *FileNotFoundException* that happens with HDFS. Error reading delta file /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta does not exist at > FileNotFoundException while reading from Kafka > -- > > Key: SPARK-39064 > URL: https://issues.apache.org/jira/browse/SPARK-39064 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.8 >Reporter: Sushmita chauhan >Priority: Critical > Fix For: 2.4.8 > > Original Estimate: 72h > Remaining Estimate: 72h > > We are running a stateful structured streaming job which reads from Kafka and > writes to HDFS. And we are hitting this exception: > > > 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): > java.lang.IllegalStateException: Error reading delta file > /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, > part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta > does not exist at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$H
[jira] [Created] (SPARK-39069) Simplify another conditionals case in predicate
Yuming Wang created SPARK-39069: --- Summary: Simplify another conditionals case in predicate Key: SPARK-39069 URL: https://issues.apache.org/jira/browse/SPARK-39069 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang {code:scala} sql( """ |CREATE TABLE t1 ( | id DECIMAL(18,0), | event_dt DATE, | cmpgn_run_dt DATE) |USING parquet |PARTITIONED BY (cmpgn_run_dt) """.stripMargin) sql( """ |select count(*) |from t1 |where CMPGN_RUN_DT >= date_sub(EVENT_DT,2) and CMPGN_RUN_DT <= EVENT_DT |and EVENT_DT ='2022-04-05' |; """.stripMargin).explain(true) {code} Excepted: {noformat} == Optimized Logical Plan == Aggregate [count(1) AS count(1)#4L] +- Project +- Filter (((isnotnull(CMPGN_RUN_DT#3) AND (CMPGN_RUN_DT#3 >= 2022-04-03)) AND (CMPGN_RUN_DT#3 <= 2022-04-05)) AND (EVENT_DT#2 = 2022-04-05)) +- Relation default.t1[id#1,event_dt#2,cmpgn_run_dt#3] parquet == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)], output=[count(1)#4L]) +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#31] +- *(1) HashAggregate(keys=[], functions=[partial_count(1)], output=[count#7L]) +- *(1) Project +- *(1) Filter (EVENT_DT#2 = 2022-04-05) +- *(1) ColumnarToRow +- FileScan parquet default.t1[event_dt#2,cmpgn_run_dt#3] Batched: true, DataFilters: [(event_dt#2 = 2022-04-05)], Format: Parquet, Location: InMemoryFileIndex[], PartitionFilters: [isnotnull(cmpgn_run_dt#3), (cmpgn_run_dt#3 >= 2022-04-03), (cmpgn_run_dt#3 <= 2022-04-05)], PushedFilters: [EqualTo(event_dt,2022-04-05)], ReadSchema: struct, UsedIndexes: [] {noformat} -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39068) Make thriftserver and sparksql-cli support in-memory catalog
[ https://issues.apache.org/jira/browse/SPARK-39068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529915#comment-17529915 ] Apache Spark commented on SPARK-39068: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/36407 > Make thriftserver and sparksql-cli support in-memory catalog > > > Key: SPARK-39068 > URL: https://issues.apache.org/jira/browse/SPARK-39068 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > # with in-memory, we can start multiple instances in the same folder without > the limitation of derby. > # spark now is multi catalog architecture, the built-in catalog is not > always necessary, in-memory is much lightweight if it is unused -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39068) Make thriftserver and sparksql-cli support in-memory catalog
[ https://issues.apache.org/jira/browse/SPARK-39068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39068: Assignee: (was: Apache Spark) > Make thriftserver and sparksql-cli support in-memory catalog > > > Key: SPARK-39068 > URL: https://issues.apache.org/jira/browse/SPARK-39068 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > > # with in-memory, we can start multiple instances in the same folder without > the limitation of derby. > # spark now is multi catalog architecture, the built-in catalog is not > always necessary, in-memory is much lightweight if it is unused -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39068) Make thriftserver and sparksql-cli support in-memory catalog
[ https://issues.apache.org/jira/browse/SPARK-39068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39068: Assignee: Apache Spark > Make thriftserver and sparksql-cli support in-memory catalog > > > Key: SPARK-39068 > URL: https://issues.apache.org/jira/browse/SPARK-39068 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > > # with in-memory, we can start multiple instances in the same folder without > the limitation of derby. > # spark now is multi catalog architecture, the built-in catalog is not > always necessary, in-memory is much lightweight if it is unused -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39068) Make thriftserver and sparksql-cli support in-memory catalog
[ https://issues.apache.org/jira/browse/SPARK-39068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529914#comment-17529914 ] Apache Spark commented on SPARK-39068: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/36407 > Make thriftserver and sparksql-cli support in-memory catalog > > > Key: SPARK-39068 > URL: https://issues.apache.org/jira/browse/SPARK-39068 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Kent Yao >Priority: Major > > # with in-memory, we can start multiple instances in the same folder without > the limitation of derby. > # spark now is multi catalog architecture, the built-in catalog is not > always necessary, in-memory is much lightweight if it is unused -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39068) Make thriftserver and sparksql-cli support in-memory catalog
Kent Yao created SPARK-39068: Summary: Make thriftserver and sparksql-cli support in-memory catalog Key: SPARK-39068 URL: https://issues.apache.org/jira/browse/SPARK-39068 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Kent Yao # with in-memory, we can start multiple instances in the same folder without the limitation of derby. # spark now is multi catalog architecture, the built-in catalog is not always necessary, in-memory is much lightweight if it is unused -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39064) FileNotFoundException while reading from Kafka
[ https://issues.apache.org/jira/browse/SPARK-39064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529902#comment-17529902 ] oskarryn commented on SPARK-39064: -- Related https://issues.apache.org/jira/browse/SPARK-26359 If you don't have a tmp delta file to rename like in the issue above, then probably the fastest workaround for you is the deletion of the checkpointDir (or defining a new checkpoint location). If the Spark reader is configured to start from the latest offset, this means probably losing some data. If the reader is configured to use the earliest offset, then some data may be reread and cause issues downstream (duplicates). Would be great if someone could explain what causes this issue and/or how to deal with this error in a better way (using the same checkpointing directory). > FileNotFoundException while reading from Kafka > -- > > Key: SPARK-39064 > URL: https://issues.apache.org/jira/browse/SPARK-39064 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.8 >Reporter: Sushmita chauhan >Priority: Critical > Fix For: 2.4.8 > > Original Estimate: 72h > Remaining Estimate: 72h > > We are running a stateful structured streaming job which reads from Kafka and > writes to HDFS. And we are hitting this exception: > > > 17/12/08 05:20:12 ERROR FileFormatWriter: Aborting job null. > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 > (TID 4, hcube1-1n03.eng.hortonworks.com, executor 1): > java.lang.IllegalStateException: Error reading delta file > /checkpointDir/state/0/0/1.delta of HDFSStateStoreProvider[id = (op=0, > part=0), dir = /checkpointDir/state/0/0]: /checkpointDir/state/0/0/1.delta > does not exist at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$updateFromDeltaFile(HDFSBackedStateStoreProvider.scala:410) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:362) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) > at scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:360) > at > org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1$$anonfun$6.apply(HDFSBackedStateStoreProvider.scala:359) > at scala.Option.getOrElse(Option.scala:121 > > > > Of course, the file doesn't exist in HDFS. And in the {{state/0/0}} directory > there is no file at all. While we have some files in the commits and offsets > folders. I am not sure about the reason of this behavior. It seems to happen > on the second time the job is started, -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39067) Upgrade scala-maven-plugin to 4.6.1
[ https://issues.apache.org/jira/browse/SPARK-39067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529898#comment-17529898 ] Apache Spark commented on SPARK-39067: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/36406 > Upgrade scala-maven-plugin to 4.6.1 > --- > > Key: SPARK-39067 > URL: https://issues.apache.org/jira/browse/SPARK-39067 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39067) Upgrade scala-maven-plugin to 4.6.1
[ https://issues.apache.org/jira/browse/SPARK-39067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39067: Assignee: Apache Spark > Upgrade scala-maven-plugin to 4.6.1 > --- > > Key: SPARK-39067 > URL: https://issues.apache.org/jira/browse/SPARK-39067 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39067) Upgrade scala-maven-plugin to 4.6.1
[ https://issues.apache.org/jira/browse/SPARK-39067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39067: Assignee: (was: Apache Spark) > Upgrade scala-maven-plugin to 4.6.1 > --- > > Key: SPARK-39067 > URL: https://issues.apache.org/jira/browse/SPARK-39067 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39067) Upgrade scala-maven-plugin to 4.6.1
Yang Jie created SPARK-39067: Summary: Upgrade scala-maven-plugin to 4.6.1 Key: SPARK-39067 URL: https://issues.apache.org/jira/browse/SPARK-39067 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529867#comment-17529867 ] Mayank Asthana edited comment on SPARK-26365 at 4/29/22 8:51 AM: - [~unnamed101] Your change looks good. Can you open a pull request on [https://github.com/apache/spark] for an official review? was (Author: masthana): [~oscar.bonilla] Your change looks good. Can you open a pull request on [https://github.com/apache/spark] for an official review? > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0 >Reporter: Oscar Bonilla >Priority: Major > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26365) spark-submit for k8s cluster doesn't propagate exit code
[ https://issues.apache.org/jira/browse/SPARK-26365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529867#comment-17529867 ] Mayank Asthana commented on SPARK-26365: [~oscar.bonilla] Your change looks good. Can you open a pull request on [https://github.com/apache/spark] for an official review? > spark-submit for k8s cluster doesn't propagate exit code > > > Key: SPARK-26365 > URL: https://issues.apache.org/jira/browse/SPARK-26365 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core, Spark Submit >Affects Versions: 2.3.2, 2.4.0, 3.0.0, 3.1.0 >Reporter: Oscar Bonilla >Priority: Major > Attachments: spark-2.4.5-raise-exception-k8s-failure.patch, > spark-3.0.0-raise-exception-k8s-failure.patch > > > When launching apps using spark-submit in a kubernetes cluster, if the Spark > applications fails (returns exit code = 1 for example), spark-submit will > still exit gracefully and return exit code = 0. > This is problematic, since there's no way to know if there's been a problem > with the Spark application. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39066) Run `./dev/test-dependencies.sh --replace-manifest ` use Maven 3.8.5 produce wrong result
[ https://issues.apache.org/jira/browse/SPARK-39066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-39066: - Attachment: spark-deps-hadoop-2-hive-2.3 > Run `./dev/test-dependencies.sh --replace-manifest ` use Maven 3.8.5 produce > wrong result > - > > Key: SPARK-39066 > URL: https://issues.apache.org/jira/browse/SPARK-39066 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > Attachments: hadoop2-deps.txt, spark-deps-hadoop-2-hive-2.3 > > > Run `./dev/test-dependencies.sh --replace-manifest ` use maven 3.8.5 produce > wrong result: > {code:java} > diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 > b/dev/deps/spark-deps-hadoop-2-hive-2.3 > index b6df3ea5ce..e803aadcfc 100644 > --- a/dev/deps/spark-deps-hadoop-2-hive-2.3 > +++ b/dev/deps/spark-deps-hadoop-2-hive-2.3 > @@ -6,15 +6,14 @@ ST4/4.0.4//ST4-4.0.4.jar > activation/1.1.1//activation-1.1.1.jar > aircompressor/0.21//aircompressor-0.21.jar > algebra_2.12/2.0.1//algebra_2.12-2.0.1.jar > +aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar > +aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar > +aliyun-java-sdk-ram/3.1.0//aliyun-java-sdk-ram-3.1.0.jar > +aliyun-sdk-oss/3.13.0//aliyun-sdk-oss-3.13.0.jar > annotations/17.0.0//annotations-17.0.0.jar > antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar > antlr4-runtime/4.8//antlr4-runtime-4.8.jar > aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar > -aopalliance/1.0//aopalliance-1.0.jar > -apacheds-i18n/2.0.0-M15//apacheds-i18n-2.0.0-M15.jar > -apacheds-kerberos-codec/2.0.0-M15//apacheds-kerberos-codec-2.0.0-M15.jar > -api-asn1-api/1.0.0-M20//api-asn1-api-1.0.0-M20.jar > -api-util/1.0.0-M20//api-util-1.0.0-M20.jar > arpack/2.2.1//arpack-2.2.1.jar > arpack_combined_all/0.1//arpack_combined_all-0.1.jar > arrow-format/7.0.0//arrow-format-7.0.0.jar > @@ -26,7 +25,10 @@ automaton/1.11-8//automaton-1.11-8.jar > avro-ipc/1.11.0//avro-ipc-1.11.0.jar > avro-mapred/1.11.0//avro-mapred-1.11.0.jar > avro/1.11.0//avro-1.11.0.jar > -azure-storage/2.0.0//azure-storage-2.0.0.jar > +aws-java-sdk-bundle/1.11.1026//aws-java-sdk-bundle-1.11.1026.jar > +azure-data-lake-store-sdk/2.3.9//azure-data-lake-store-sdk-2.3.9.jar > +azure-keyvault-core/1.0.0//azure-keyvault-core-1.0.0.jar > +azure-storage/7.0.1//azure-storage-7.0.1.jar > blas/2.2.1//blas-2.2.1.jar > bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar > breeze-macros_2.12/1.2//breeze-macros_2.12-1.2.jar > @@ -34,28 +36,24 @@ breeze_2.12/1.2//breeze_2.12-1.2.jar > cats-kernel_2.12/2.1.1//cats-kernel_2.12-2.1.1.jar > chill-java/0.10.0//chill-java-0.10.0.jar > chill_2.12/0.10.0//chill_2.12-0.10.0.jar > -commons-beanutils/1.9.4//commons-beanutils-1.9.4.jar > commons-cli/1.5.0//commons-cli-1.5.0.jar > commons-codec/1.15//commons-codec-1.15.jar > commons-collections/3.2.2//commons-collections-3.2.2.jar > commons-collections4/4.4//commons-collections4-4.4.jar > commons-compiler/3.0.16//commons-compiler-3.0.16.jar > commons-compress/1.21//commons-compress-1.21.jar > -commons-configuration/1.6//commons-configuration-1.6.jar > commons-crypto/1.1.0//commons-crypto-1.1.0.jar > commons-dbcp/1.4//commons-dbcp-1.4.jar > -commons-digester/1.8//commons-digester-1.8.jar > -commons-httpclient/3.1//commons-httpclient-3.1.jar > commons-io/2.4//commons-io-2.4.jar > commons-lang/2.6//commons-lang-2.6.jar > commons-lang3/3.12.0//commons-lang3-3.12.0.jar > commons-logging/1.1.3//commons-logging-1.1.3.jar > commons-math3/3.6.1//commons-math3-3.6.1.jar > -commons-net/3.1//commons-net-3.1.jar > commons-pool/1.5.4//commons-pool-1.5.4.jar > commons-text/1.9//commons-text-1.9.jar > compress-lzf/1.1//compress-lzf-1.1.jar > core/1.1.2//core-1.1.2.jar > +cos_api-bundle/5.6.19//cos_api-bundle-5.6.19.jar > curator-client/2.7.1//curator-client-2.7.1.jar > curator-framework/2.7.1//curator-framework-2.7.1.jar > curator-recipes/2.7.1//curator-recipes-2.7.1.jar > @@ -69,25 +67,17 @@ generex/1.0.2//generex-1.0.2.jar > gmetric4j/1.0.10//gmetric4j-1.0.10.jar > gson/2.2.4//gson-2.2.4.jar > guava/14.0.1//guava-14.0.1.jar > -guice-servlet/3.0//guice-servlet-3.0.jar > -guice/3.0//guice-3.0.jar > -hadoop-annotations/2.7.4//hadoop-annotations-2.7.4.jar > -hadoop-auth/2.7.4//hadoop-auth-2.7.4.jar > -hadoop-aws/2.7.4//hadoop-aws-2.7.4.jar > -hadoop-azure/2.7.4//hadoop-azure-2.7.4.jar > -hadoop-client/2.7.4//hadoop-client-2.7.4.jar > -hadoop-common/2.7.4//hadoop-common-2.7.4.jar > -hadoop-hdfs/2.7.4//hadoop-hdfs-2.7.4.jar > -hadoop-mapreduce-client-app/2.7.4//hadoop-mapreduce-client-app-2.7.4.jar > -hadoop-mapreduce-client-common/2.7.4//hadoop-mapreduce-client-common-2.7.4.jar > -hadoop-mapreduce-client-core/2.7.4//hadoop-mapred
[jira] [Updated] (SPARK-39066) Run `./dev/test-dependencies.sh --replace-manifest ` use Maven 3.8.5 produce wrong result
[ https://issues.apache.org/jira/browse/SPARK-39066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-39066: - Attachment: hadoop2-deps.txt > Run `./dev/test-dependencies.sh --replace-manifest ` use Maven 3.8.5 produce > wrong result > - > > Key: SPARK-39066 > URL: https://issues.apache.org/jira/browse/SPARK-39066 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Major > Attachments: hadoop2-deps.txt > > > Run `./dev/test-dependencies.sh --replace-manifest ` use maven 3.8.5 produce > wrong result: > {code:java} > diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 > b/dev/deps/spark-deps-hadoop-2-hive-2.3 > index b6df3ea5ce..e803aadcfc 100644 > --- a/dev/deps/spark-deps-hadoop-2-hive-2.3 > +++ b/dev/deps/spark-deps-hadoop-2-hive-2.3 > @@ -6,15 +6,14 @@ ST4/4.0.4//ST4-4.0.4.jar > activation/1.1.1//activation-1.1.1.jar > aircompressor/0.21//aircompressor-0.21.jar > algebra_2.12/2.0.1//algebra_2.12-2.0.1.jar > +aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar > +aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar > +aliyun-java-sdk-ram/3.1.0//aliyun-java-sdk-ram-3.1.0.jar > +aliyun-sdk-oss/3.13.0//aliyun-sdk-oss-3.13.0.jar > annotations/17.0.0//annotations-17.0.0.jar > antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar > antlr4-runtime/4.8//antlr4-runtime-4.8.jar > aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar > -aopalliance/1.0//aopalliance-1.0.jar > -apacheds-i18n/2.0.0-M15//apacheds-i18n-2.0.0-M15.jar > -apacheds-kerberos-codec/2.0.0-M15//apacheds-kerberos-codec-2.0.0-M15.jar > -api-asn1-api/1.0.0-M20//api-asn1-api-1.0.0-M20.jar > -api-util/1.0.0-M20//api-util-1.0.0-M20.jar > arpack/2.2.1//arpack-2.2.1.jar > arpack_combined_all/0.1//arpack_combined_all-0.1.jar > arrow-format/7.0.0//arrow-format-7.0.0.jar > @@ -26,7 +25,10 @@ automaton/1.11-8//automaton-1.11-8.jar > avro-ipc/1.11.0//avro-ipc-1.11.0.jar > avro-mapred/1.11.0//avro-mapred-1.11.0.jar > avro/1.11.0//avro-1.11.0.jar > -azure-storage/2.0.0//azure-storage-2.0.0.jar > +aws-java-sdk-bundle/1.11.1026//aws-java-sdk-bundle-1.11.1026.jar > +azure-data-lake-store-sdk/2.3.9//azure-data-lake-store-sdk-2.3.9.jar > +azure-keyvault-core/1.0.0//azure-keyvault-core-1.0.0.jar > +azure-storage/7.0.1//azure-storage-7.0.1.jar > blas/2.2.1//blas-2.2.1.jar > bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar > breeze-macros_2.12/1.2//breeze-macros_2.12-1.2.jar > @@ -34,28 +36,24 @@ breeze_2.12/1.2//breeze_2.12-1.2.jar > cats-kernel_2.12/2.1.1//cats-kernel_2.12-2.1.1.jar > chill-java/0.10.0//chill-java-0.10.0.jar > chill_2.12/0.10.0//chill_2.12-0.10.0.jar > -commons-beanutils/1.9.4//commons-beanutils-1.9.4.jar > commons-cli/1.5.0//commons-cli-1.5.0.jar > commons-codec/1.15//commons-codec-1.15.jar > commons-collections/3.2.2//commons-collections-3.2.2.jar > commons-collections4/4.4//commons-collections4-4.4.jar > commons-compiler/3.0.16//commons-compiler-3.0.16.jar > commons-compress/1.21//commons-compress-1.21.jar > -commons-configuration/1.6//commons-configuration-1.6.jar > commons-crypto/1.1.0//commons-crypto-1.1.0.jar > commons-dbcp/1.4//commons-dbcp-1.4.jar > -commons-digester/1.8//commons-digester-1.8.jar > -commons-httpclient/3.1//commons-httpclient-3.1.jar > commons-io/2.4//commons-io-2.4.jar > commons-lang/2.6//commons-lang-2.6.jar > commons-lang3/3.12.0//commons-lang3-3.12.0.jar > commons-logging/1.1.3//commons-logging-1.1.3.jar > commons-math3/3.6.1//commons-math3-3.6.1.jar > -commons-net/3.1//commons-net-3.1.jar > commons-pool/1.5.4//commons-pool-1.5.4.jar > commons-text/1.9//commons-text-1.9.jar > compress-lzf/1.1//compress-lzf-1.1.jar > core/1.1.2//core-1.1.2.jar > +cos_api-bundle/5.6.19//cos_api-bundle-5.6.19.jar > curator-client/2.7.1//curator-client-2.7.1.jar > curator-framework/2.7.1//curator-framework-2.7.1.jar > curator-recipes/2.7.1//curator-recipes-2.7.1.jar > @@ -69,25 +67,17 @@ generex/1.0.2//generex-1.0.2.jar > gmetric4j/1.0.10//gmetric4j-1.0.10.jar > gson/2.2.4//gson-2.2.4.jar > guava/14.0.1//guava-14.0.1.jar > -guice-servlet/3.0//guice-servlet-3.0.jar > -guice/3.0//guice-3.0.jar > -hadoop-annotations/2.7.4//hadoop-annotations-2.7.4.jar > -hadoop-auth/2.7.4//hadoop-auth-2.7.4.jar > -hadoop-aws/2.7.4//hadoop-aws-2.7.4.jar > -hadoop-azure/2.7.4//hadoop-azure-2.7.4.jar > -hadoop-client/2.7.4//hadoop-client-2.7.4.jar > -hadoop-common/2.7.4//hadoop-common-2.7.4.jar > -hadoop-hdfs/2.7.4//hadoop-hdfs-2.7.4.jar > -hadoop-mapreduce-client-app/2.7.4//hadoop-mapreduce-client-app-2.7.4.jar > -hadoop-mapreduce-client-common/2.7.4//hadoop-mapreduce-client-common-2.7.4.jar > -hadoop-mapreduce-client-core/2.7.4//hadoop-mapreduce-client-core-2.7.4.jar > -hadoop-mapred
[jira] [Created] (SPARK-39066) Run `./dev/test-dependencies.sh --replace-manifest ` use Maven 3.8.5 produce wrong result
Yang Jie created SPARK-39066: Summary: Run `./dev/test-dependencies.sh --replace-manifest ` use Maven 3.8.5 produce wrong result Key: SPARK-39066 URL: https://issues.apache.org/jira/browse/SPARK-39066 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.4.0 Reporter: Yang Jie Run `./dev/test-dependencies.sh --replace-manifest ` use maven 3.8.5 produce wrong result: {code:java} diff --git a/dev/deps/spark-deps-hadoop-2-hive-2.3 b/dev/deps/spark-deps-hadoop-2-hive-2.3 index b6df3ea5ce..e803aadcfc 100644 --- a/dev/deps/spark-deps-hadoop-2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2-hive-2.3 @@ -6,15 +6,14 @@ ST4/4.0.4//ST4-4.0.4.jar activation/1.1.1//activation-1.1.1.jar aircompressor/0.21//aircompressor-0.21.jar algebra_2.12/2.0.1//algebra_2.12-2.0.1.jar +aliyun-java-sdk-core/4.5.10//aliyun-java-sdk-core-4.5.10.jar +aliyun-java-sdk-kms/2.11.0//aliyun-java-sdk-kms-2.11.0.jar +aliyun-java-sdk-ram/3.1.0//aliyun-java-sdk-ram-3.1.0.jar +aliyun-sdk-oss/3.13.0//aliyun-sdk-oss-3.13.0.jar annotations/17.0.0//annotations-17.0.0.jar antlr-runtime/3.5.2//antlr-runtime-3.5.2.jar antlr4-runtime/4.8//antlr4-runtime-4.8.jar aopalliance-repackaged/2.6.1//aopalliance-repackaged-2.6.1.jar -aopalliance/1.0//aopalliance-1.0.jar -apacheds-i18n/2.0.0-M15//apacheds-i18n-2.0.0-M15.jar -apacheds-kerberos-codec/2.0.0-M15//apacheds-kerberos-codec-2.0.0-M15.jar -api-asn1-api/1.0.0-M20//api-asn1-api-1.0.0-M20.jar -api-util/1.0.0-M20//api-util-1.0.0-M20.jar arpack/2.2.1//arpack-2.2.1.jar arpack_combined_all/0.1//arpack_combined_all-0.1.jar arrow-format/7.0.0//arrow-format-7.0.0.jar @@ -26,7 +25,10 @@ automaton/1.11-8//automaton-1.11-8.jar avro-ipc/1.11.0//avro-ipc-1.11.0.jar avro-mapred/1.11.0//avro-mapred-1.11.0.jar avro/1.11.0//avro-1.11.0.jar -azure-storage/2.0.0//azure-storage-2.0.0.jar +aws-java-sdk-bundle/1.11.1026//aws-java-sdk-bundle-1.11.1026.jar +azure-data-lake-store-sdk/2.3.9//azure-data-lake-store-sdk-2.3.9.jar +azure-keyvault-core/1.0.0//azure-keyvault-core-1.0.0.jar +azure-storage/7.0.1//azure-storage-7.0.1.jar blas/2.2.1//blas-2.2.1.jar bonecp/0.8.0.RELEASE//bonecp-0.8.0.RELEASE.jar breeze-macros_2.12/1.2//breeze-macros_2.12-1.2.jar @@ -34,28 +36,24 @@ breeze_2.12/1.2//breeze_2.12-1.2.jar cats-kernel_2.12/2.1.1//cats-kernel_2.12-2.1.1.jar chill-java/0.10.0//chill-java-0.10.0.jar chill_2.12/0.10.0//chill_2.12-0.10.0.jar -commons-beanutils/1.9.4//commons-beanutils-1.9.4.jar commons-cli/1.5.0//commons-cli-1.5.0.jar commons-codec/1.15//commons-codec-1.15.jar commons-collections/3.2.2//commons-collections-3.2.2.jar commons-collections4/4.4//commons-collections4-4.4.jar commons-compiler/3.0.16//commons-compiler-3.0.16.jar commons-compress/1.21//commons-compress-1.21.jar -commons-configuration/1.6//commons-configuration-1.6.jar commons-crypto/1.1.0//commons-crypto-1.1.0.jar commons-dbcp/1.4//commons-dbcp-1.4.jar -commons-digester/1.8//commons-digester-1.8.jar -commons-httpclient/3.1//commons-httpclient-3.1.jar commons-io/2.4//commons-io-2.4.jar commons-lang/2.6//commons-lang-2.6.jar commons-lang3/3.12.0//commons-lang3-3.12.0.jar commons-logging/1.1.3//commons-logging-1.1.3.jar commons-math3/3.6.1//commons-math3-3.6.1.jar -commons-net/3.1//commons-net-3.1.jar commons-pool/1.5.4//commons-pool-1.5.4.jar commons-text/1.9//commons-text-1.9.jar compress-lzf/1.1//compress-lzf-1.1.jar core/1.1.2//core-1.1.2.jar +cos_api-bundle/5.6.19//cos_api-bundle-5.6.19.jar curator-client/2.7.1//curator-client-2.7.1.jar curator-framework/2.7.1//curator-framework-2.7.1.jar curator-recipes/2.7.1//curator-recipes-2.7.1.jar @@ -69,25 +67,17 @@ generex/1.0.2//generex-1.0.2.jar gmetric4j/1.0.10//gmetric4j-1.0.10.jar gson/2.2.4//gson-2.2.4.jar guava/14.0.1//guava-14.0.1.jar -guice-servlet/3.0//guice-servlet-3.0.jar -guice/3.0//guice-3.0.jar -hadoop-annotations/2.7.4//hadoop-annotations-2.7.4.jar -hadoop-auth/2.7.4//hadoop-auth-2.7.4.jar -hadoop-aws/2.7.4//hadoop-aws-2.7.4.jar -hadoop-azure/2.7.4//hadoop-azure-2.7.4.jar -hadoop-client/2.7.4//hadoop-client-2.7.4.jar -hadoop-common/2.7.4//hadoop-common-2.7.4.jar -hadoop-hdfs/2.7.4//hadoop-hdfs-2.7.4.jar -hadoop-mapreduce-client-app/2.7.4//hadoop-mapreduce-client-app-2.7.4.jar -hadoop-mapreduce-client-common/2.7.4//hadoop-mapreduce-client-common-2.7.4.jar -hadoop-mapreduce-client-core/2.7.4//hadoop-mapreduce-client-core-2.7.4.jar -hadoop-mapreduce-client-jobclient/2.7.4//hadoop-mapreduce-client-jobclient-2.7.4.jar -hadoop-mapreduce-client-shuffle/2.7.4//hadoop-mapreduce-client-shuffle-2.7.4.jar -hadoop-openstack/2.7.4//hadoop-openstack-2.7.4.jar -hadoop-yarn-api/2.7.4//hadoop-yarn-api-2.7.4.jar -hadoop-yarn-client/2.7.4//hadoop-yarn-client-2.7.4.jar -hadoop-yarn-common/2.7.4//hadoop-yarn-common-2.7.4.jar -hadoop-yarn-server-common/2.7.4//hadoop-yarn-server-common-2.7.4.jar +hadoop-aliyun/3.3.2//hadoop-aliyun-3.3.2.jar +had
[jira] [Assigned] (SPARK-39065) DS V2 Limit push-down should avoid out of memory
[ https://issues.apache.org/jira/browse/SPARK-39065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39065: Assignee: Apache Spark > DS V2 Limit push-down should avoid out of memory > > > Key: SPARK-39065 > URL: https://issues.apache.org/jira/browse/SPARK-39065 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > Currently, Spark DS V2 supports push down Limit operator to data source. > But the behavior only controlled by pushDownList option. > If the limit is very large, then Executor will pull all the result set from > data source. > So it will cause the memory issue as you know. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39065) DS V2 Limit push-down should avoid out of memory
[ https://issues.apache.org/jira/browse/SPARK-39065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-39065: Assignee: (was: Apache Spark) > DS V2 Limit push-down should avoid out of memory > > > Key: SPARK-39065 > URL: https://issues.apache.org/jira/browse/SPARK-39065 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, Spark DS V2 supports push down Limit operator to data source. > But the behavior only controlled by pushDownList option. > If the limit is very large, then Executor will pull all the result set from > data source. > So it will cause the memory issue as you know. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39065) DS V2 Limit push-down should avoid out of memory
[ https://issues.apache.org/jira/browse/SPARK-39065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529860#comment-17529860 ] Apache Spark commented on SPARK-39065: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/36405 > DS V2 Limit push-down should avoid out of memory > > > Key: SPARK-39065 > URL: https://issues.apache.org/jira/browse/SPARK-39065 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > Currently, Spark DS V2 supports push down Limit operator to data source. > But the behavior only controlled by pushDownList option. > If the limit is very large, then Executor will pull all the result set from > data source. > So it will cause the memory issue as you know. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38737) Test the error classes: INVALID_FIELD_NAME
[ https://issues.apache.org/jira/browse/SPARK-38737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38737: Assignee: (was: Apache Spark) > Test the error classes: INVALID_FIELD_NAME > -- > > Key: SPARK-38737 > URL: https://issues.apache.org/jira/browse/SPARK-38737 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add tests for the error class *INVALID_FIELD_NAME* to > QueryCompilationErrorsSuite. The test should cover the exception throw in > QueryCompilationErrors: > {code:scala} > def invalidFieldName(fieldName: Seq[String], path: Seq[String], context: > Origin): Throwable = { > new AnalysisException( > errorClass = "INVALID_FIELD_NAME", > messageParameters = Array(fieldName.quoted, path.quoted), > origin = context) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38737) Test the error classes: INVALID_FIELD_NAME
[ https://issues.apache.org/jira/browse/SPARK-38737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38737: Assignee: Apache Spark > Test the error classes: INVALID_FIELD_NAME > -- > > Key: SPARK-38737 > URL: https://issues.apache.org/jira/browse/SPARK-38737 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Minor > Labels: starter > > Add tests for the error class *INVALID_FIELD_NAME* to > QueryCompilationErrorsSuite. The test should cover the exception throw in > QueryCompilationErrors: > {code:scala} > def invalidFieldName(fieldName: Seq[String], path: Seq[String], context: > Origin): Throwable = { > new AnalysisException( > errorClass = "INVALID_FIELD_NAME", > messageParameters = Array(fieldName.quoted, path.quoted), > origin = context) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38737) Test the error classes: INVALID_FIELD_NAME
[ https://issues.apache.org/jira/browse/SPARK-38737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-38737: Assignee: Apache Spark > Test the error classes: INVALID_FIELD_NAME > -- > > Key: SPARK-38737 > URL: https://issues.apache.org/jira/browse/SPARK-38737 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Minor > Labels: starter > > Add tests for the error class *INVALID_FIELD_NAME* to > QueryCompilationErrorsSuite. The test should cover the exception throw in > QueryCompilationErrors: > {code:scala} > def invalidFieldName(fieldName: Seq[String], path: Seq[String], context: > Origin): Throwable = { > new AnalysisException( > errorClass = "INVALID_FIELD_NAME", > messageParameters = Array(fieldName.quoted, path.quoted), > origin = context) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38737) Test the error classes: INVALID_FIELD_NAME
[ https://issues.apache.org/jira/browse/SPARK-38737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529859#comment-17529859 ] Apache Spark commented on SPARK-38737: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/36404 > Test the error classes: INVALID_FIELD_NAME > -- > > Key: SPARK-38737 > URL: https://issues.apache.org/jira/browse/SPARK-38737 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Minor > Labels: starter > > Add tests for the error class *INVALID_FIELD_NAME* to > QueryCompilationErrorsSuite. The test should cover the exception throw in > QueryCompilationErrors: > {code:scala} > def invalidFieldName(fieldName: Seq[String], path: Seq[String], context: > Origin): Throwable = { > new AnalysisException( > errorClass = "INVALID_FIELD_NAME", > messageParameters = Array(fieldName.quoted, path.quoted), > origin = context) > } > {code} > For example, here is a test for the error class *UNSUPPORTED_FEATURE*: > https://github.com/apache/spark/blob/34e3029a43d2a8241f70f2343be8285cb7f231b9/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryCompilationErrorsSuite.scala#L151-L170 > +The test must have a check of:+ > # the entire error message > # sqlState if it is defined in the error-classes.json file > # the error class -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39065) DS V2 Limit push-down should avoid out of memory
jiaan.geng created SPARK-39065: -- Summary: DS V2 Limit push-down should avoid out of memory Key: SPARK-39065 URL: https://issues.apache.org/jira/browse/SPARK-39065 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: jiaan.geng Currently, Spark DS V2 supports push down Limit operator to data source. But the behavior only controlled by pushDownList option. If the limit is very large, then Executor will pull all the result set from data source. So it will cause the memory issue as you know. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-39035) Add tests for options from `to_csv` and `from_csv`.
[ https://issues.apache.org/jira/browse/SPARK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-39035: Assignee: Haejoon Lee > Add tests for options from `to_csv` and `from_csv`. > --- > > Key: SPARK-39035 > URL: https://issues.apache.org/jira/browse/SPARK-39035 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > There are many supported options for to_json and from_json > (https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option), > but they are currently not tested. > We should test for options to > `sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala`. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39035) Add tests for options from `to_csv` and `from_csv`.
[ https://issues.apache.org/jira/browse/SPARK-39035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-39035. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 36401 [https://github.com/apache/spark/pull/36401] > Add tests for options from `to_csv` and `from_csv`. > --- > > Key: SPARK-39035 > URL: https://issues.apache.org/jira/browse/SPARK-39035 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > There are many supported options for to_json and from_json > (https://spark.apache.org/docs/latest/sql-data-sources-csv.html#data-source-option), > but they are currently not tested. > We should test for options to > `sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala`. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org