[jira] [Assigned] (SPARK-44072) Update the incorrect sql example of insert table documentation
[ https://issues.apache.org/jira/browse/SPARK-44072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44072: --- Assignee: Yang Zhang > Update the incorrect sql example of insert table documentation > -- > > Key: SPARK-44072 > URL: https://issues.apache.org/jira/browse/SPARK-44072 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Yang Zhang >Assignee: Yang Zhang >Priority: Major > Fix For: 3.5.0 > > > Latest docs of insert table has an incorrect sql example about 'Insert Using > a Typed Date Literal for a Partition Column Value'. > It should be > {code:java} > INSERT OVERWRITE students PARTITION (birthday = date'2019-01-02') > VALUES('Jason Wang', '908 Bird St, Saratoga'); {code} > Doc link: > https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-table.html#insert-using-a-typed-date-literal-for-a-partition-column-value-1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44072) Update the incorrect sql example of insert table documentation
[ https://issues.apache.org/jira/browse/SPARK-44072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44072: Fix Version/s: (was: 3.4.1) (was: 3.3.3) > Update the incorrect sql example of insert table documentation > -- > > Key: SPARK-44072 > URL: https://issues.apache.org/jira/browse/SPARK-44072 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Yang Zhang >Priority: Major > Fix For: 3.5.0 > > > Latest docs of insert table has an incorrect sql example about 'Insert Using > a Typed Date Literal for a Partition Column Value'. > It should be > {code:java} > INSERT OVERWRITE students PARTITION (birthday = date'2019-01-02') > VALUES('Jason Wang', '908 Bird St, Saratoga'); {code} > Doc link: > https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-table.html#insert-using-a-typed-date-literal-for-a-partition-column-value-1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44072) Update the incorrect sql example of insert table documentation
[ https://issues.apache.org/jira/browse/SPARK-44072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-44072: Fix Version/s: 3.3.3 3.4.1 > Update the incorrect sql example of insert table documentation > -- > > Key: SPARK-44072 > URL: https://issues.apache.org/jira/browse/SPARK-44072 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Yang Zhang >Priority: Major > Fix For: 3.3.3, 3.4.1, 3.5.0 > > > Latest docs of insert table has an incorrect sql example about 'Insert Using > a Typed Date Literal for a Partition Column Value'. > It should be > {code:java} > INSERT OVERWRITE students PARTITION (birthday = date'2019-01-02') > VALUES('Jason Wang', '908 Bird St, Saratoga'); {code} > Doc link: > https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-table.html#insert-using-a-typed-date-literal-for-a-partition-column-value-1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44077) Session Configs were not getting honored in RDDs
[ https://issues.apache.org/jira/browse/SPARK-44077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kapil Singh updated SPARK-44077: Description: When calling SQLConf.get on executors, the configs are read from the local properties on the TaskContext. The local properties are populated driver-side when scheduling the job, using the properties found in sparkContext.localProperties. For RDD actions, local properties were not getting populated. > Session Configs were not getting honored in RDDs > > > Key: SPARK-44077 > URL: https://issues.apache.org/jira/browse/SPARK-44077 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Kapil Singh >Priority: Major > > When calling SQLConf.get on executors, the configs are read from the local > properties on the TaskContext. The local properties are populated driver-side > when scheduling the job, using the properties found in > sparkContext.localProperties. For RDD actions, local properties were not > getting populated. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44077) Session Configs were not getting honored in RDDs
Kapil Singh created SPARK-44077: --- Summary: Session Configs were not getting honored in RDDs Key: SPARK-44077 URL: https://issues.apache.org/jira/browse/SPARK-44077 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.4.0 Reporter: Kapil Singh -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44040) Incorrect result after count distinct
[ https://issues.apache.org/jira/browse/SPARK-44040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-44040: --- Assignee: Yuming Wang > Incorrect result after count distinct > - > > Key: SPARK-44040 > URL: https://issues.apache.org/jira/browse/SPARK-44040 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.4.0 >Reporter: Aleksandr Aleksandrov >Assignee: Yuming Wang >Priority: Critical > > When i try to call count after distinct function for Decimal null field, > spark return incorrect result starting from spark 3.4.0. > A minimal example to reproduce: > import org.apache.spark.sql.types._ > import org.apache.spark.sql.\{Column, DataFrame, Dataset, Row, SparkSession} > import org.apache.spark.sql.types.\{StringType, StructField, StructType} > val schema = StructType( Array( > StructField("money", DecimalType(38,6), true), > StructField("reference_id", StringType, true) > )) > val payDf = spark.createDataFrame(sc.emptyRDD[Row], schema) > val aggDf = payDf.agg(sum("money").as("money")).withColumn("name", lit("df1")) > val aggDf1 = payDf.agg(sum("money").as("money")).withColumn("name", > lit("df2")) > val unionDF: DataFrame = aggDf.union(aggDf1) > unionDF.select("money").distinct.show // return correct result > unionDF.select("money").distinct.count // return 2 instead of 1 > unionDF.select("money").distinct.count == 1 // return false > This block of code returns some assertion error and after that an incorrect > count (in spark 3.2.1 everything works fine and i get correct result = 1): > *scala> unionDF.select("money").distinct.show // return correct result* > java.lang.AssertionError: assertion failed: > Decimal$DecimalIsFractional > while compiling: > during phase: globalPhase=terminal, enteringPhase=jvm > library version: version 2.12.17 > compiler version: version 2.12.17 > reconstructed args: -classpath > /Users/aleksandrov/.ivy2/jars/org.apache.spark_spark-connect_2.12-3.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-core_2.12-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-storage-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar:/Users/aleksandrov/.ivy2/jars/org.antlr_antlr4-runtime-4.9.3.jar > -Yrepl-class-based -Yrepl-outdir > /private/var/folders/qj/_dn4xbp14jn37qmdk7ylyfwcgr/T/spark-f37bb154-75f3-4db7-aea8-3c4363377bd8/repl-350f37a1-1df1-4816-bd62-97929c60a6c1 > last tree to typer: TypeTree(class Byte) > tree position: line 6 of > tree tpe: Byte > symbol: (final abstract) class Byte in package scala > symbol definition: final abstract class Byte extends (a ClassSymbol) > symbol package: scala > symbol owners: class Byte > call site: constructor $eval in object $eval in package $line19 > == Source file context for tree position == > 3 > 4object $eval { > 5lazyval $result = > $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0 > 6lazyval $print: {_}root{_}.java.lang.String = { > 7 $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw > 8 > 9"" > at > scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185) > at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525) > at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514) > at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353) > at > scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346) > at > scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348) > at > scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487) > at > scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) > at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799) > at scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805) > at scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28) > at > scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324) > at > scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342) > at > scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645) > at > scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413) > at > scala.reflect.internal.pickling.UnPickler$Scan.$anonfun$readSymbol$10(UnPickler.scala:357) > at scala.reflect.internal.pickling.UnPickler$Scan.at(UnPickler.scala:188) > at >
[jira] [Commented] (SPARK-44075) Make 'transformStatCorr' lazy
[ https://issues.apache.org/jira/browse/SPARK-44075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733308#comment-17733308 ] Snoot.io commented on SPARK-44075: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/41621 > Make 'transformStatCorr' lazy > - > > Key: SPARK-44075 > URL: https://issues.apache.org/jira/browse/SPARK-44075 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43928) Add bit operations to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-43928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733307#comment-17733307 ] Snoot.io commented on SPARK-43928: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/41608 > Add bit operations to Scala and Python > -- > > Key: SPARK-43928 > URL: https://issues.apache.org/jira/browse/SPARK-43928 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > > Add following functions: > * bit_and > * bit_count > * bit_get > * bit_or > * bit_xor > * getbit > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44040) Incorrect result after count distinct
[ https://issues.apache.org/jira/browse/SPARK-44040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44040. - Fix Version/s: 3.3.3 3.5.0 3.4.1 Resolution: Fixed Issue resolved by pull request 41576 [https://github.com/apache/spark/pull/41576] > Incorrect result after count distinct > - > > Key: SPARK-44040 > URL: https://issues.apache.org/jira/browse/SPARK-44040 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.2, 3.4.0 >Reporter: Aleksandr Aleksandrov >Assignee: Yuming Wang >Priority: Critical > Fix For: 3.3.3, 3.5.0, 3.4.1 > > > When i try to call count after distinct function for Decimal null field, > spark return incorrect result starting from spark 3.4.0. > A minimal example to reproduce: > import org.apache.spark.sql.types._ > import org.apache.spark.sql.\{Column, DataFrame, Dataset, Row, SparkSession} > import org.apache.spark.sql.types.\{StringType, StructField, StructType} > val schema = StructType( Array( > StructField("money", DecimalType(38,6), true), > StructField("reference_id", StringType, true) > )) > val payDf = spark.createDataFrame(sc.emptyRDD[Row], schema) > val aggDf = payDf.agg(sum("money").as("money")).withColumn("name", lit("df1")) > val aggDf1 = payDf.agg(sum("money").as("money")).withColumn("name", > lit("df2")) > val unionDF: DataFrame = aggDf.union(aggDf1) > unionDF.select("money").distinct.show // return correct result > unionDF.select("money").distinct.count // return 2 instead of 1 > unionDF.select("money").distinct.count == 1 // return false > This block of code returns some assertion error and after that an incorrect > count (in spark 3.2.1 everything works fine and i get correct result = 1): > *scala> unionDF.select("money").distinct.show // return correct result* > java.lang.AssertionError: assertion failed: > Decimal$DecimalIsFractional > while compiling: > during phase: globalPhase=terminal, enteringPhase=jvm > library version: version 2.12.17 > compiler version: version 2.12.17 > reconstructed args: -classpath > /Users/aleksandrov/.ivy2/jars/org.apache.spark_spark-connect_2.12-3.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-core_2.12-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/io.delta_delta-storage-2.4.0.jar:/Users/aleksandrov/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar:/Users/aleksandrov/.ivy2/jars/org.antlr_antlr4-runtime-4.9.3.jar > -Yrepl-class-based -Yrepl-outdir > /private/var/folders/qj/_dn4xbp14jn37qmdk7ylyfwcgr/T/spark-f37bb154-75f3-4db7-aea8-3c4363377bd8/repl-350f37a1-1df1-4816-bd62-97929c60a6c1 > last tree to typer: TypeTree(class Byte) > tree position: line 6 of > tree tpe: Byte > symbol: (final abstract) class Byte in package scala > symbol definition: final abstract class Byte extends (a ClassSymbol) > symbol package: scala > symbol owners: class Byte > call site: constructor $eval in object $eval in package $line19 > == Source file context for tree position == > 3 > 4object $eval { > 5lazyval $result = > $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.res0 > 6lazyval $print: {_}root{_}.java.lang.String = { > 7 $line19.$read.INSTANCE.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw.$iw > 8 > 9"" > at > scala.reflect.internal.SymbolTable.throwAssertionError(SymbolTable.scala:185) > at scala.reflect.internal.Symbols$Symbol.completeInfo(Symbols.scala:1525) > at scala.reflect.internal.Symbols$Symbol.info(Symbols.scala:1514) > at scala.reflect.internal.Symbols$Symbol.flatOwnerInfo(Symbols.scala:2353) > at > scala.reflect.internal.Symbols$ClassSymbol.companionModule0(Symbols.scala:3346) > at > scala.reflect.internal.Symbols$ClassSymbol.companionModule(Symbols.scala:3348) > at > scala.reflect.internal.Symbols$ModuleClassSymbol.sourceModule(Symbols.scala:3487) > at > scala.reflect.internal.Symbols.$anonfun$forEachRelevantSymbols$1$adapted(Symbols.scala:3802) > at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) > at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) > at scala.reflect.internal.Symbols.markFlagsCompleted(Symbols.scala:3799) > at scala.reflect.internal.Symbols.markFlagsCompleted$(Symbols.scala:3805) > at scala.reflect.internal.SymbolTable.markFlagsCompleted(SymbolTable.scala:28) > at > scala.reflect.internal.pickling.UnPickler$Scan.finishSym$1(UnPickler.scala:324) > at > scala.reflect.internal.pickling.UnPickler$Scan.readSymbol(UnPickler.scala:342) > at > scala.reflect.internal.pickling.UnPickler$Scan.readSymbolRef(UnPickler.scala:645) > at > scala.reflect.internal.pickling.UnPickler$Scan.readType(UnPickler.scala:413) > at >
[jira] [Created] (SPARK-44076) SPIP: Python Data Source API
Allison Wang created SPARK-44076: Summary: SPIP: Python Data Source API Key: SPARK-44076 URL: https://issues.apache.org/jira/browse/SPARK-44076 Project: Spark Issue Type: New Feature Components: PySpark Affects Versions: 3.5.0 Reporter: Allison Wang This proposal aims to introduce a simple API in Python for Data Sources. The idea is to enable Python developers to create data sources without having to learn Scala or deal with the complexities of the current data source APIs. The goal is to make a Python-based API that is simple and easy to use, thus making Spark more accessible to the wider Python developer community. This proposed approach is based on the recently introduced Python user-defined table functions (SPARK-43797) with extensions to support data sources. {*}SPIP{*}: [https://docs.google.com/document/d/1oYrCKEKHzznljYfJO4kx5K_Npcgt1Slyfph3NEk7JRU/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44075) Make 'transformStatCorr' lazy
Ruifeng Zheng created SPARK-44075: - Summary: Make 'transformStatCorr' lazy Key: SPARK-44075 URL: https://issues.apache.org/jira/browse/SPARK-44075 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43474) Add support to create DataFrame Reference in Spark connect
[ https://issues.apache.org/jira/browse/SPARK-43474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733305#comment-17733305 ] Snoot.io commented on SPARK-43474: -- User 'rangadi' has created a pull request for this issue: https://github.com/apache/spark/pull/41618 > Add support to create DataFrame Reference in Spark connect > -- > > Key: SPARK-43474 > URL: https://issues.apache.org/jira/browse/SPARK-43474 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Peng Zhong >Priority: Major > > Add support in Spark Connect to cache a DataFrame on server side. From client > side, it can create a reference to that DataFrame given the cache key. > > This function will be used in streaming foreachBatch(). Server needs to call > user function for every batch which takes a DataFrame as argument. With the > new function, we can just cache the DataFrame on the server. Pass the id back > to client which can creates the DataFrame reference. The server will replace > the reference when transforming. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44025) CSV Table Read Error with CharType(length) column
[ https://issues.apache.org/jira/browse/SPARK-44025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733304#comment-17733304 ] Snoot.io commented on SPARK-44025: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41564 > CSV Table Read Error with CharType(length) column > - > > Key: SPARK-44025 > URL: https://issues.apache.org/jira/browse/SPARK-44025 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 > Environment: {{apache/spark:v3.4.0 image}} >Reporter: Fengyu Cao >Priority: Major > > Problem: > # read a CSV format table > # table has a `CharType(length)` column > # read table failed with Exception: `org.apache.spark.SparkException: Job > aborted due to stage failure: Task 0 in stage 36.0 failed 4 times, most > recent failure: Lost task 0.3 in stage 36.0 (TID 72) (10.113.9.208 executor > 11): java.lang.IllegalArgumentException: requirement failed: requiredSchema > (struct) should be the subset of dataSchema > (struct).` > > reproduce with official image: > # {{docker run -it apache/spark:v3.4.0 /opt/spark/bin/spark-sql}} > # {{CREATE TABLE csv_bug (name STRING, age INT, job CHAR(4)) USING CSV > OPTIONS ('header' = 'true', 'sep' = ';') LOCATION > "/opt/spark/examples/src/main/resources/people.csv";}} > # SELECT * FROM csv_bug; > # ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) > java.lang.IllegalArgumentException: requirement failed: requiredSchema > (struct) should be the subset of dataSchema > (struct). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44072) Update the incorrect sql example of insert table documentation
[ https://issues.apache.org/jira/browse/SPARK-44072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733303#comment-17733303 ] Snoot.io commented on SPARK-44072: -- User 'Yohahaha' has created a pull request for this issue: https://github.com/apache/spark/pull/41619 > Update the incorrect sql example of insert table documentation > -- > > Key: SPARK-44072 > URL: https://issues.apache.org/jira/browse/SPARK-44072 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Yang Zhang >Priority: Major > Fix For: 3.5.0 > > > Latest docs of insert table has an incorrect sql example about 'Insert Using > a Typed Date Literal for a Partition Column Value'. > It should be > {code:java} > INSERT OVERWRITE students PARTITION (birthday = date'2019-01-02') > VALUES('Jason Wang', '908 Bird St, Saratoga'); {code} > Doc link: > https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-table.html#insert-using-a-typed-date-literal-for-a-partition-column-value-1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44060) Code-gen for build side outer shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-44060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733302#comment-17733302 ] Snoot.io commented on SPARK-44060: -- User 'szehon-ho' has created a pull request for this issue: https://github.com/apache/spark/pull/41614 > Code-gen for build side outer shuffled hash join > > > Key: SPARK-44060 > URL: https://issues.apache.org/jira/browse/SPARK-44060 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Szehon Ho >Priority: Major > > Here, build side outer join means LEFT OUTER join with build left, or RIGHT > OUTER join with build right. > As a followup for https://github.com/apache/spark/pull/41398/ SPARK-36612 > (non-codegen build-side outer shuffled hash join), this task is to add > code-gen for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44060) Code-gen for build side outer shuffled hash join
[ https://issues.apache.org/jira/browse/SPARK-44060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733301#comment-17733301 ] Snoot.io commented on SPARK-44060: -- User 'szehon-ho' has created a pull request for this issue: https://github.com/apache/spark/pull/41614 > Code-gen for build side outer shuffled hash join > > > Key: SPARK-44060 > URL: https://issues.apache.org/jira/browse/SPARK-44060 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: Szehon Ho >Priority: Major > > Here, build side outer join means LEFT OUTER join with build left, or RIGHT > OUTER join with build right. > As a followup for https://github.com/apache/spark/pull/41398/ SPARK-36612 > (non-codegen build-side outer shuffled hash join), this task is to add > code-gen for it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44065) Optimize BroadcastHashJoin skew when localShuffleReader is disabled
[ https://issues.apache.org/jira/browse/SPARK-44065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733300#comment-17733300 ] GridGain Integration commented on SPARK-44065: -- User 'wForget' has created a pull request for this issue: https://github.com/apache/spark/pull/41609 > Optimize BroadcastHashJoin skew when localShuffleReader is disabled > --- > > Key: SPARK-44065 > URL: https://issues.apache.org/jira/browse/SPARK-44065 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Zhen Wang >Priority: Major > > In RemoteShuffleService services such as uniffle and celeborn, it is > recommended to disable localShuffleReader by default for better performance. > But it may make BroadcastHashJoin skewed, so I want to optimize > BroadcastHashJoin skew in OptimizeSkewedJoin when localShuffleReader is > disabled. > > Refer to: > https://github.com/apache/incubator-celeborn#spark-configuration > https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md#support-spark-aqe -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44074) `Logging plan changes for execution` test failed
Yang Jie created SPARK-44074: Summary: `Logging plan changes for execution` test failed Key: SPARK-44074 URL: https://issues.apache.org/jira/browse/SPARK-44074 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 3.5.0 Reporter: Yang Jie run {{build/sbt clean "sql/test" -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest,org.apache.spark.tags.SlowSQLTest}} {{}} {code:java} 2023-06-15T19:58:34.4105460Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32mQueryExecutionSuite:�[0m�[0m 2023-06-15T19:58:34.5395268Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- dumping query execution info to a file (77 milliseconds)�[0m�[0m 2023-06-15T19:58:34.5856902Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- dumping query execution info to an existing file (49 milliseconds)�[0m�[0m 2023-06-15T19:58:34.6099849Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- dumping query execution info to non-existing folder (25 milliseconds)�[0m�[0m 2023-06-15T19:58:34.6136467Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- dumping query execution info by invalid path (4 milliseconds)�[0m�[0m 2023-06-15T19:58:34.6425071Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- dumping query execution info to a file - explainMode=formatted (28 milliseconds)�[0m�[0m 2023-06-15T19:58:34.7084916Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- limit number of fields by sql config (66 milliseconds)�[0m�[0m 2023-06-15T19:58:34.7432299Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- check maximum fields restriction (34 milliseconds)�[0m�[0m 2023-06-15T19:58:34.7554546Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- toString() exception/error handling (11 milliseconds)�[0m�[0m 2023-06-15T19:58:34.7621424Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[32m- SPARK-28346: clone the query plan between different stages (6 milliseconds)�[0m�[0m 2023-06-15T19:58:34.8001412Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m- Logging plan changes for execution *** FAILED *** (12 milliseconds)�[0m�[0m 2023-06-15T19:58:34.8007977Z �[0m[�[0m�[0minfo�[0m] �[0m�[0m�[31m testAppender.loggingEvents.exists(((x$10: org.apache.logging.log4j.core.LogEvent) => x$10.getMessage().getFormattedMessage().contains(expectedMsg))) was false (QueryExecutionSuite.scala:232)�[0m�[0m {code} but run {{build/sbt "sql/testOnly *QueryExecutionSuite"}} not this issue, need to investigate. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43929) Add date time functions to Scala and Python - part 1
[ https://issues.apache.org/jira/browse/SPARK-43929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-43929: -- Description: Add following functions: * date_diff * date_from_unix_date * date_part * dateadd * datepart * day to: * Scala API * Python API * Spark Connect Scala Client * Spark Connect Python Client was: Add following functions: * date_diff * date_from_unix_date * date_part * dateadd * datepart * day * weekday * convert_timezone * extract * now * timestamp_micros * timestamp_millis to: * Scala API * Python API * Spark Connect Scala Client * Spark Connect Python Client > Add date time functions to Scala and Python - part 1 > > > Key: SPARK-43929 > URL: https://issues.apache.org/jira/browse/SPARK-43929 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > > Add following functions: > * date_diff > * date_from_unix_date > * date_part > * dateadd > * datepart > * day > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44073) Add date time functions to Scala and Python - part 2
Ruifeng Zheng created SPARK-44073: - Summary: Add date time functions to Scala and Python - part 2 Key: SPARK-44073 URL: https://issues.apache.org/jira/browse/SPARK-44073 Project: Spark Issue Type: Sub-task Components: Connect, PySpark, SQL Affects Versions: 3.5.0 Reporter: Ruifeng Zheng Add following functions: * weekday * convert_timezone * extract * now * timestamp_micros * timestamp_millis to: * Scala API * Python API * Spark Connect Scala Client * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43929) Add date time functions to Scala and Python - part 1
[ https://issues.apache.org/jira/browse/SPARK-43929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-43929: -- Summary: Add date time functions to Scala and Python - part 1 (was: Add date time functions to Scala and Python) > Add date time functions to Scala and Python - part 1 > > > Key: SPARK-43929 > URL: https://issues.apache.org/jira/browse/SPARK-43929 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > > Add following functions: > * date_diff > * date_from_unix_date > * date_part > * dateadd > * datepart > * day > * weekday > * convert_timezone > * extract > * now > * timestamp_micros > * timestamp_millis > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44072) Update the incorrect sql example of insert table documentation
[ https://issues.apache.org/jira/browse/SPARK-44072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-44072. -- Fix Version/s: (was: 3.4.1) (was: 3.3.3) Resolution: Fixed Issue resolved by pull request 41619 [https://github.com/apache/spark/pull/41619] > Update the incorrect sql example of insert table documentation > -- > > Key: SPARK-44072 > URL: https://issues.apache.org/jira/browse/SPARK-44072 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Yang Zhang >Priority: Major > Fix For: 3.5.0 > > > Latest docs of insert table has an incorrect sql example about 'Insert Using > a Typed Date Literal for a Partition Column Value'. > It should be > {code:java} > INSERT OVERWRITE students PARTITION (birthday = date'2019-01-02') > VALUES('Jason Wang', '908 Bird St, Saratoga'); {code} > Doc link: > https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-table.html#insert-using-a-typed-date-literal-for-a-partition-column-value-1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41599) Memory leak in FileSystem.CACHE when submitting apps to secure cluster using InProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-41599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733286#comment-17733286 ] Xieming Li edited comment on SPARK-41599 at 6/16/23 3:01 AM: - [~ste...@apache.org] [~maciejsmolenski] I am having this issue as well. I'm encountering the same problem. Could you please guide me on how to "explicitly disable the cache for that filesystem schema"? I am trying to add the following configurations in my core-site.xml, but not sure if this is the right way. {code:java} fs.hdfs.impl.disable.cache true fs.viewfs.impl.disable.cache true {code} was (Author: risyomei): [~ste...@apache.org] [~maciejsmolenski] I am having this issue as well. I'm encountering the same problem. Could you please guide me on how to "explicitly disable the cache for that filesystem schema"? > Memory leak in FileSystem.CACHE when submitting apps to secure cluster using > InProcessLauncher > -- > > Key: SPARK-41599 > URL: https://issues.apache.org/jira/browse/SPARK-41599 > Project: Spark > Issue Type: Bug > Components: Deploy, YARN >Affects Versions: 3.1.2 >Reporter: Maciej Smolenski >Priority: Major > Attachments: InProcLaunchFsIssue.scala, > SPARK-41599-fixes-to-limit-FileSystem-CACHE-size-when-using-InProcessLauncher.diff > > > When submitting spark application in kerberos environment the credentials of > 'current user' (UserGroupInformation.getCurrentUser()) are being modified. > Filesystem.CACHE entries contain 'current user' (with user credentials) as a > key. > Submitting many spark applications using InProcessLauncher cause that > FileSystem.CACHE becomes bigger and bigger. > Finally process exits because of OutOfMemory error. > Code for reproduction attached. > > Output from running 'jmap -histo' on reproduction jvm shows that the number > of FileSystem$Cache$Key increases in time: > time: #instances class > 1671533274: 2 org.apache.hadoop.fs.FileSystem$Cache$Key > 167155: 11 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533395: 21 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533455: 30 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533515: 39 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533576: 48 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533636: 57 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533696: 66 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533757: 75 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533817: 84 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533877: 93 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533937: 102 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533998: 111 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534058: 120 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534118: 135 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534178: 140 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534239: 150 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534299: 159 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534359: 168 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534419: 177 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534480: 186 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534540: 195 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534600: 204 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534661: 213 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534721: 222 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534781: 231 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534841: 240 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534902: 249 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534962: 257 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535022: 264 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535083: 273 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535143: 282 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535203: 291 org.apache.hadoop.fs.FileSystem$Cache$Key -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41599) Memory leak in FileSystem.CACHE when submitting apps to secure cluster using InProcessLauncher
[ https://issues.apache.org/jira/browse/SPARK-41599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733286#comment-17733286 ] Xieming Li commented on SPARK-41599: [~ste...@apache.org] [~maciejsmolenski] I am having this issue as well. I'm encountering the same problem. Could you please guide me on how to "explicitly disable the cache for that filesystem schema"? > Memory leak in FileSystem.CACHE when submitting apps to secure cluster using > InProcessLauncher > -- > > Key: SPARK-41599 > URL: https://issues.apache.org/jira/browse/SPARK-41599 > Project: Spark > Issue Type: Bug > Components: Deploy, YARN >Affects Versions: 3.1.2 >Reporter: Maciej Smolenski >Priority: Major > Attachments: InProcLaunchFsIssue.scala, > SPARK-41599-fixes-to-limit-FileSystem-CACHE-size-when-using-InProcessLauncher.diff > > > When submitting spark application in kerberos environment the credentials of > 'current user' (UserGroupInformation.getCurrentUser()) are being modified. > Filesystem.CACHE entries contain 'current user' (with user credentials) as a > key. > Submitting many spark applications using InProcessLauncher cause that > FileSystem.CACHE becomes bigger and bigger. > Finally process exits because of OutOfMemory error. > Code for reproduction attached. > > Output from running 'jmap -histo' on reproduction jvm shows that the number > of FileSystem$Cache$Key increases in time: > time: #instances class > 1671533274: 2 org.apache.hadoop.fs.FileSystem$Cache$Key > 167155: 11 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533395: 21 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533455: 30 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533515: 39 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533576: 48 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533636: 57 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533696: 66 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533757: 75 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533817: 84 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533877: 93 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533937: 102 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671533998: 111 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534058: 120 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534118: 135 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534178: 140 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534239: 150 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534299: 159 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534359: 168 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534419: 177 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534480: 186 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534540: 195 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534600: 204 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534661: 213 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534721: 222 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534781: 231 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534841: 240 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534902: 249 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671534962: 257 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535022: 264 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535083: 273 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535143: 282 org.apache.hadoop.fs.FileSystem$Cache$Key > 1671535203: 291 org.apache.hadoop.fs.FileSystem$Cache$Key -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44072) Update the incorrect sql example of insert table documentation
[ https://issues.apache.org/jira/browse/SPARK-44072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Zhang updated SPARK-44072: --- Description: Latest docs of insert table has an incorrect sql example about 'Insert Using a Typed Date Literal for a Partition Column Value'. It should be {code:java} INSERT OVERWRITE students PARTITION (birthday = date'2019-01-02') VALUES('Jason Wang', '908 Bird St, Saratoga'); {code} Doc link: https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-table.html#insert-using-a-typed-date-literal-for-a-partition-column-value-1 was: Latest docs of insert table has an incorrect sql example about 'Insert Using a Typed Date Literal for a Partition Column Value'. It should be {code:java} INSERT OVERWRITE students PARTITION (birthday = date'2019-01-02') VALUES('Jason Wang', '908 Bird St, Saratoga'); {code} > Update the incorrect sql example of insert table documentation > -- > > Key: SPARK-44072 > URL: https://issues.apache.org/jira/browse/SPARK-44072 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Yang Zhang >Priority: Major > Fix For: 3.3.3, 3.4.1, 3.5.0 > > > Latest docs of insert table has an incorrect sql example about 'Insert Using > a Typed Date Literal for a Partition Column Value'. > It should be > {code:java} > INSERT OVERWRITE students PARTITION (birthday = date'2019-01-02') > VALUES('Jason Wang', '908 Bird St, Saratoga'); {code} > Doc link: > https://spark.apache.org/docs/latest/sql-ref-syntax-dml-insert-table.html#insert-using-a-typed-date-literal-for-a-partition-column-value-1 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44072) Update the incorrect sql example of insert table documentation
[ https://issues.apache.org/jira/browse/SPARK-44072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Zhang updated SPARK-44072: --- Description: Latest docs of insert table has an incorrect sql example about 'Insert Using a Typed Date Literal for a Partition Column Value'. It should be {code:java} INSERT OVERWRITE students PARTITION (birthday = date'2019-01-02') VALUES('Jason Wang', '908 Bird St, Saratoga'); {code} > Update the incorrect sql example of insert table documentation > -- > > Key: SPARK-44072 > URL: https://issues.apache.org/jira/browse/SPARK-44072 > Project: Spark > Issue Type: Documentation > Components: Documentation >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Yang Zhang >Priority: Major > Fix For: 3.3.3, 3.4.1, 3.5.0 > > > Latest docs of insert table has an incorrect sql example about 'Insert Using > a Typed Date Literal for a Partition Column Value'. > It should be > {code:java} > INSERT OVERWRITE students PARTITION (birthday = date'2019-01-02') > VALUES('Jason Wang', '908 Bird St, Saratoga'); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43201) Inconsistency between from_avro and from_json function
[ https://issues.apache.org/jira/browse/SPARK-43201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733281#comment-17733281 ] Jia Fan commented on SPARK-43201: - If avroSchema1 not equals avroSchema2, the dataframe's schema would not match for each row. This will be a problem. > Inconsistency between from_avro and from_json function > -- > > Key: SPARK-43201 > URL: https://issues.apache.org/jira/browse/SPARK-43201 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Philip Adetiloye >Priority: Major > > Spark from_avro function does not allow schema parameter to use dataframe > column but takes only a String schema: > {code:java} > def from_avro(col: Column, jsonFormatSchema: String): Column {code} > This makes it impossible to deserialize rows of Avro records with different > schema since only one schema string could be pass externally. > > Here is what I would expect like from_json function: > {code:java} > def from_avro(col: Column, jsonFormatSchema: Column): Column {code} > code example: > {code:java} > import org.apache.spark.sql.functions.from_avro > val avroSchema1 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > > val avroSchema2 = > """{"type":"record","name":"myrecord","fields":[{"name":"str1","type":"string"},{"name":"str2","type":"string"}]}""" > val df = Seq( > (Array[Byte](10, 97, 112, 112, 108, 101, 49, 0), avroSchema1), > (Array[Byte](10, 97, 112, 112, 108, 101, 50, 0), avroSchema2) > ).toDF("binaryData", "schema") > val parsed = df.select(from_avro($"binaryData", $"schema").as("parsedData")) > parsed.show() > // Output: > // ++ > // | parsedData| > // ++ > // |[apple1, 1.0]| > // |[apple2, 2.0]| > // ++ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44072) Update the incorrect sql example of insert table documentation
Yang Zhang created SPARK-44072: -- Summary: Update the incorrect sql example of insert table documentation Key: SPARK-44072 URL: https://issues.apache.org/jira/browse/SPARK-44072 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.3.3, 3.4.1, 3.5.0 Reporter: Yang Zhang Fix For: 3.3.3, 3.4.1, 3.5.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44065) Optimize BroadcastHashJoin skew when localShuffleReader is disabled
[ https://issues.apache.org/jira/browse/SPARK-44065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733271#comment-17733271 ] Zhen Wang commented on SPARK-44065: --- https://github.com/apache/spark/pull/41609 > Optimize BroadcastHashJoin skew when localShuffleReader is disabled > --- > > Key: SPARK-44065 > URL: https://issues.apache.org/jira/browse/SPARK-44065 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Zhen Wang >Priority: Major > > In RemoteShuffleService services such as uniffle and celeborn, it is > recommended to disable localShuffleReader by default for better performance. > But it may make BroadcastHashJoin skewed, so I want to optimize > BroadcastHashJoin skew in OptimizeSkewedJoin when localShuffleReader is > disabled. > > Refer to: > https://github.com/apache/incubator-celeborn#spark-configuration > https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md#support-spark-aqe -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-43937) Add ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-43937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-43937: - Assignee: BingKun Pan > Add ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and Python > --- > > Key: SPARK-43937 > URL: https://issues.apache.org/jira/browse/SPARK-43937 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: BingKun Pan >Priority: Major > > Add following functions: > * -not- > * -if- > * ifnull > * isnotnull > * equal_null > * nullif > * nvl > * nvl2 > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43937) Add ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-43937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43937. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41534 [https://github.com/apache/spark/pull/41534] > Add ifnull,isnotnull,equal_null,nullif,nvl,nvl2 to Scala and Python > --- > > Key: SPARK-43937 > URL: https://issues.apache.org/jira/browse/SPARK-43937 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: BingKun Pan >Priority: Major > Fix For: 3.5.0 > > > Add following functions: > * -not- > * -if- > * ifnull > * isnotnull > * equal_null > * nullif > * nvl > * nvl2 > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43925) Add some, bool_or,bool_and,every to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-43925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-43925: -- Description: Add following functions: * -any- * some * bool_or * bool_and * every to: * Scala API * Python API * Spark Connect Scala Client * Spark Connect Python Client was: Add following functions: * any * some * bool_or * bool_and * every to: * Scala API * Python API * Spark Connect Scala Client * Spark Connect Python Client > Add some, bool_or,bool_and,every to Scala and Python > > > Key: SPARK-43925 > URL: https://issues.apache.org/jira/browse/SPARK-43925 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > > Add following functions: > * -any- > * some > * bool_or > * bool_and > * every > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43925) Add some, bool_or,bool_and,every to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-43925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng updated SPARK-43925: -- Summary: Add some, bool_or,bool_and,every to Scala and Python (was: Add any, some, bool_or,bool_and,every to Scala and Python) > Add some, bool_or,bool_and,every to Scala and Python > > > Key: SPARK-43925 > URL: https://issues.apache.org/jira/browse/SPARK-43925 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > > Add following functions: > * any > * some > * bool_or > * bool_and > * every > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44071) Define UnresolvedNode trait to reduce redundancy
Ryan Johnson created SPARK-44071: Summary: Define UnresolvedNode trait to reduce redundancy Key: SPARK-44071 URL: https://issues.apache.org/jira/browse/SPARK-44071 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.5.0 Reporter: Ryan Johnson Looking at [unresolved.scala|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala], spark would benefit from an {{UnresolvedNode}} trait that various {{UnresolvedFoo}} classes could inherit from: {code:java} trait UnresolvedNode extends LogicalPlan { override def output: Seq[Attribute] = Nil override lazy val resolved = false }{code} Today, the code is duplicated in ~20 locations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43511) Implemented State APIs for Spark Connect Scala
[ https://issues.apache.org/jira/browse/SPARK-43511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733195#comment-17733195 ] GridGain Integration commented on SPARK-43511: -- User 'bogao007' has created a pull request for this issue: https://github.com/apache/spark/pull/41558 > Implemented State APIs for Spark Connect Scala > -- > > Key: SPARK-43511 > URL: https://issues.apache.org/jira/browse/SPARK-43511 > Project: Spark > Issue Type: Task > Components: Connect, Structured Streaming >Affects Versions: 3.5.0 >Reporter: Bo Gao >Priority: Major > > Implemented MapGroupsWithState and FlatMapGroupsWithState APIs for Spark > Connect Scala -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44070) Bump snappy-java 1.1.10.1
Cheng Pan created SPARK-44070: - Summary: Bump snappy-java 1.1.10.1 Key: SPARK-44070 URL: https://issues.apache.org/jira/browse/SPARK-44070 Project: Spark Issue Type: Dependency upgrade Components: Build Affects Versions: 3.5.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44055) Remove redundant `override` from `CheckpointRDD`
[ https://issues.apache.org/jira/browse/SPARK-44055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie reassigned SPARK-44055: Assignee: Yang Jie > Remove redundant `override` from `CheckpointRDD` > > > Key: SPARK-44055 > URL: https://issues.apache.org/jira/browse/SPARK-44055 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44055) Remove redundant `override` from `CheckpointRDD`
[ https://issues.apache.org/jira/browse/SPARK-44055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-44055. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41597 [https://github.com/apache/spark/pull/41597] > Remove redundant `override` from `CheckpointRDD` > > > Key: SPARK-44055 > URL: https://issues.apache.org/jira/browse/SPARK-44055 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44069) maven test ReplSuite failed
[ https://issues.apache.org/jira/browse/SPARK-44069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-44069: - Description: https://github.com/LuciferYang/spark/actions/runs/5274544416/jobs/9541917589 (was: {code:java} ./build/mvn -DskipTests -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl clean install build/mvn test -pl repl{code} {code:java} ReplSuite: 17500Spark context available as 'sc' (master = local, app id = local-1686829049116). 17501Spark session available as 'spark'. 17502- SPARK-15236: use Hive catalog *** FAILED *** 17503 isContain was true Interpreter output contained 'Exception': 17504 Welcome to 17505 __ 17506 / __/__ ___ _/ /__ 17507 _\ \/ _ \/ _ `/ __/ '_/ 17508 /___/ .__/\_,_/_/ /_/\_\ version 3.5.0-SNAPSHOT 17509/_/ 17510 17511 Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_372) 17512 Type in expressions to have them evaluated. 17513 Type :help for more information. 17514 17515 scala> 17516 scala> java.lang.NoClassDefFoundError: org/sparkproject/guava/cache/CacheBuilder 17517at org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:197) 17518at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog$lzycompute(BaseSessionStateBuilder.scala:153) 17519at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog(BaseSessionStateBuilder.scala:152) 17520at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog$lzycompute(BaseSessionStateBuilder.scala:166) 17521at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog(BaseSessionStateBuilder.scala:166) 17522at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager$lzycompute(BaseSessionStateBuilder.scala:168) 17523at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager(BaseSessionStateBuilder.scala:168) 17524at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.(BaseSessionStateBuilder.scala:185) 17525at org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:185) 17526at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$2(BaseSessionStateBuilder.scala:373) 17527at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:92) 17528at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:92) 17529at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76) 17530at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) 17531at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:202) 17532at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:529) 17533at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:202) 17534at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) 17535at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:201) 17536at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76) 17537at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) 17538at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66) 17539at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) 17540at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) 17541at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) 17542at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:640) 17543at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) 17544at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:630) 17545at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:671) 17546... 94 elided 17547 Caused by: java.lang.ClassNotFoundException: org.sparkproject.guava.cache.CacheBuilder 17548at java.net.URLClassLoader.findClass(URLClassLoader.java:387) 17549at java.lang.ClassLoader.loadClass(ClassLoader.java:418) 17550at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 17551at java.lang.ClassLoader.loadClass(ClassLoader.java:351) 17552... 123 more 17553 17554 scala> | 17555 scala> :quit (ReplSuite.scala:83) 17556Spark context available as 'sc' (master = local, app id = local-1686829054261). 17557Spark session available as 'spark'. 17558- SPARK-15236: use in-memory catalog 17559Spark context available as 'sc' (master = local, app id = local-1686829056083). 17560Spark session available as 'spark'. 17561- broadcast vars 17562Spark context available as 'sc' (master =
[jira] [Created] (SPARK-44069) maven test ReplSuite failed
Yang Jie created SPARK-44069: Summary: maven test ReplSuite failed Key: SPARK-44069 URL: https://issues.apache.org/jira/browse/SPARK-44069 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0 Reporter: Yang Jie {code:java} ./build/mvn -DskipTests -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pspark-ganglia-lgpl clean install build/mvn test -pl repl{code} {code:java} ReplSuite: 17500Spark context available as 'sc' (master = local, app id = local-1686829049116). 17501Spark session available as 'spark'. 17502- SPARK-15236: use Hive catalog *** FAILED *** 17503 isContain was true Interpreter output contained 'Exception': 17504 Welcome to 17505 __ 17506 / __/__ ___ _/ /__ 17507 _\ \/ _ \/ _ `/ __/ '_/ 17508 /___/ .__/\_,_/_/ /_/\_\ version 3.5.0-SNAPSHOT 17509/_/ 17510 17511 Using Scala version 2.12.17 (OpenJDK 64-Bit Server VM, Java 1.8.0_372) 17512 Type in expressions to have them evaluated. 17513 Type :help for more information. 17514 17515 scala> 17516 scala> java.lang.NoClassDefFoundError: org/sparkproject/guava/cache/CacheBuilder 17517at org.apache.spark.sql.catalyst.catalog.SessionCatalog.(SessionCatalog.scala:197) 17518at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog$lzycompute(BaseSessionStateBuilder.scala:153) 17519at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalog(BaseSessionStateBuilder.scala:152) 17520at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog$lzycompute(BaseSessionStateBuilder.scala:166) 17521at org.apache.spark.sql.internal.BaseSessionStateBuilder.v2SessionCatalog(BaseSessionStateBuilder.scala:166) 17522at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager$lzycompute(BaseSessionStateBuilder.scala:168) 17523at org.apache.spark.sql.internal.BaseSessionStateBuilder.catalogManager(BaseSessionStateBuilder.scala:168) 17524at org.apache.spark.sql.internal.BaseSessionStateBuilder$$anon$1.(BaseSessionStateBuilder.scala:185) 17525at org.apache.spark.sql.internal.BaseSessionStateBuilder.analyzer(BaseSessionStateBuilder.scala:185) 17526at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$2(BaseSessionStateBuilder.scala:373) 17527at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:92) 17528at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:92) 17529at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:76) 17530at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) 17531at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:202) 17532at org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:529) 17533at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:202) 17534at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) 17535at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:201) 17536at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:76) 17537at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:74) 17538at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:66) 17539at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) 17540at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) 17541at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) 17542at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:640) 17543at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:827) 17544at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:630) 17545at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:671) 17546... 94 elided 17547 Caused by: java.lang.ClassNotFoundException: org.sparkproject.guava.cache.CacheBuilder 17548at java.net.URLClassLoader.findClass(URLClassLoader.java:387) 17549at java.lang.ClassLoader.loadClass(ClassLoader.java:418) 17550at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 17551at java.lang.ClassLoader.loadClass(ClassLoader.java:351) 17552... 123 more 17553 17554 scala> | 17555 scala> :quit (ReplSuite.scala:83) 17556Spark context available as 'sc' (master = local, app id = local-1686829054261). 17557Spark session available as 'spark'. 17558- SPARK-15236: use in-memory catalog 17559Spark context available as 'sc' (master = local, app id = local-1686829056083). 17560Spark session available as
[jira] [Created] (SPARK-44068) Support positional parameters in Scala connect client
Max Gekk created SPARK-44068: Summary: Support positional parameters in Scala connect client Key: SPARK-44068 URL: https://issues.apache.org/jira/browse/SPARK-44068 Project: Spark Issue Type: New Feature Components: Connect, SQL Affects Versions: 3.5.0 Reporter: Max Gekk Implement positional parameters of parametrized queries in the Scala connect client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43942) Add string functions to Scala and Python - part 1
[ https://issues.apache.org/jira/browse/SPARK-43942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733028#comment-17733028 ] Hudson commented on SPARK-43942: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41561 > Add string functions to Scala and Python - part 1 > - > > Key: SPARK-43942 > URL: https://issues.apache.org/jira/browse/SPARK-43942 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > > Add following functions: > * char > * btrim > * char_length > * character_length > * chr > * contains > * elt > * find_in_set > * like > * ilike > * lcase > * ucase > * len > * left > * right > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43942) Add string functions to Scala and Python - part 1
[ https://issues.apache.org/jira/browse/SPARK-43942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17733027#comment-17733027 ] Hudson commented on SPARK-43942: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/41561 > Add string functions to Scala and Python - part 1 > - > > Key: SPARK-43942 > URL: https://issues.apache.org/jira/browse/SPARK-43942 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Priority: Major > > Add following functions: > * char > * btrim > * char_length > * character_length > * chr > * contains > * elt > * find_in_set > * like > * ilike > * lcase > * ucase > * len > * left > * right > to: > * Scala API > * Python API > * Spark Connect Scala Client > * Spark Connect Python Client -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44067) Warning for the pandas-related behavior changes in next major release
Haejoon Lee created SPARK-44067: --- Summary: Warning for the pandas-related behavior changes in next major release Key: SPARK-44067 URL: https://issues.apache.org/jira/browse/SPARK-44067 Project: Spark Issue Type: Sub-task Components: Pandas API on Spark, PySpark Affects Versions: 3.5.0 Reporter: Haejoon Lee There will be many breaking changes in Spark 4.0.0. so we should warn in advance. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44066) Support positional parameters in parameterized query
[ https://issues.apache.org/jira/browse/SPARK-44066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732955#comment-17732955 ] ASF GitHub Bot commented on SPARK-44066: User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/41568 > Support positional parameters in parameterized query > > > Key: SPARK-44066 > URL: https://issues.apache.org/jira/browse/SPARK-44066 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > As a follow-up to the parameterized query we added recently, we’d like to > support positional parameters. This is part of the SQL standard and JDBC/ODBC > protocol. > Example: update COFFEES set TOTAL = TOTAL + ? where COF_NAME = ? > Note that positional and named param marker cannot be used in the same query. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44066) Support positional parameters in parameterized query
[ https://issues.apache.org/jira/browse/SPARK-44066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732957#comment-17732957 ] ASF GitHub Bot commented on SPARK-44066: User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/41568 > Support positional parameters in parameterized query > > > Key: SPARK-44066 > URL: https://issues.apache.org/jira/browse/SPARK-44066 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.5.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > As a follow-up to the parameterized query we added recently, we’d like to > support positional parameters. This is part of the SQL standard and JDBC/ODBC > protocol. > Example: update COFFEES set TOTAL = TOTAL + ? where COF_NAME = ? > Note that positional and named param marker cannot be used in the same query. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43952) Cancel Spark jobs not only by a single "jobgroup", but allow multiple "job tags"
[ https://issues.apache.org/jira/browse/SPARK-43952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732947#comment-17732947 ] ASF GitHub Bot commented on SPARK-43952: User 'juliuszsompolski' has created a pull request for this issue: https://github.com/apache/spark/pull/41440 > Cancel Spark jobs not only by a single "jobgroup", but allow multiple "job > tags" > > > Key: SPARK-43952 > URL: https://issues.apache.org/jira/browse/SPARK-43952 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.5.0 >Reporter: Juliusz Sompolski >Priority: Major > > Currently, the only way to cancel running Spark Jobs is by using > SparkContext.cancelJobGroup, using a job group name that was previously set > using SparkContext.setJobGroup. This is problematic if multiple different > parts of the system want to do cancellation, and set their own ids. > For example, > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/BroadcastExchangeExec.scala#L133] > sets it's own job group, which may override job group set by user. This way, > if user cancels the job group they set, it will not cancel these broadcast > jobs launches from within their jobs... > As a solution, consider adding SparkContext.addJobTag / > SparkContext.removeJobTag, which would allow to have multiple "tags" on the > jobs, and introduce SparkContext.cancelJobsByTag to allow more flexible > cancelling of jobs. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38200) [SQL] Spark JDBC Savemode Supports Upsert
[ https://issues.apache.org/jira/browse/SPARK-38200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732945#comment-17732945 ] Enrico Minack commented on SPARK-38200: --- Created pull request for this: https://github.com/apache/spark/pull/41611 > [SQL] Spark JDBC Savemode Supports Upsert > - > > Key: SPARK-38200 > URL: https://issues.apache.org/jira/browse/SPARK-38200 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: melin >Priority: Major > > upsert sql for different databases, Most databases support merge sql: > sqlserver merge into sql : > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/sqlserver/SqlServerDialect.java] > mysql: > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/mysql/MysqlDialect.java] > oracle merge into sql : > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/oracle/OracleDialect.java] > postgres: > [https://github.com/apache/incubator-seatunnel/blob/dev/seatunnel-connectors-v2/connector-jdbc/src/main/java/org/apache/seatunnel/connectors/seatunnel/jdbc/internal/dialect/psql/PostgresDialect.java] > postgres merg into sql : > [https://www.postgresql.org/docs/current/sql-merge.html] > db2 merge into sql : > [https://www.ibm.com/docs/en/db2-for-zos/12?topic=statements-merge] > derby merge into sql: > [https://db.apache.org/derby/docs/10.14/ref/rrefsqljmerge.html] > he merg into sql : > [https://www.tutorialspoint.com/h2_database/h2_database_merge.htm] > > [~yao] > > https://github.com/melin/datatunnel/tree/master/plugins/jdbc/src/main/scala/com/superior/datatunnel/plugin/jdbc/support/dialect > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC
[ https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732943#comment-17732943 ] Enrico Minack commented on SPARK-19335: --- Created pull request for this: https://github.com/apache/spark/pull/41518 > Spark should support doing an efficient DataFrame Upsert via JDBC > - > > Key: SPARK-19335 > URL: https://issues.apache.org/jira/browse/SPARK-19335 > Project: Spark > Issue Type: Improvement >Reporter: Ilya Ganelin >Priority: Minor > > Doing a database update, as opposed to an insert is useful, particularly when > working with streaming applications which may require revisions to previously > stored data. > Spark DataFrames/DataSets do not currently support an Update feature via the > JDBC Writer allowing only Overwrite or Append. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44052) Add util to get proper Column or DataFrame class for Spark Connect.
[ https://issues.apache.org/jira/browse/SPARK-44052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732941#comment-17732941 ] Ignite TC Bot commented on SPARK-44052: --- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/41570 > Add util to get proper Column or DataFrame class for Spark Connect. > --- > > Key: SPARK-44052 > URL: https://issues.apache.org/jira/browse/SPARK-44052 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > There are many codes are duplicated to get proper PySparkColumn or > PySparkDataFrame, so it would be great if we have util function to > deduplicate these codes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43291) Match behavior for DataFrame.cov on string DataFrame
[ https://issues.apache.org/jira/browse/SPARK-43291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17732932#comment-17732932 ] Haejoon Lee commented on SPARK-43291: - With the major release of pandas 2.0.0 on April 3, 2023, numerous breaking changes have been introduced. So, we have made the decision to postpone addressing these breaking changes until the next major release of Spark, version 4.0.0 to minimize disruptions for our users and provide a more seamless upgrade experience. The pandas 2.0.0 release includes a significant number of updates, such as API removals, changes in API behavior, parameter removals, parameter behavior changes, and bug fixes. We have planned the following approach for each item: - {*}API Removals{*}: Removed APIs will remain deprecated in Spark 3.5.0, provide appropriate warnings, and will be removed in Spark 4.0.0. - {*}API Behavior Changes{*}: APIs with changed behavior will retain the behavior in Spark 3.5.0, provide appropriate warnings, and will align the behavior with pandas in Spark 4.0.0. - {*}Parameter Removals{*}: Removed parameters will remain deprecated in Spark 3.5.0, provide appropriate warnings, and will be removed in Spark 4.0.0. - {*}Parameter Behavior Changes{*}: Parameters with changed behavior will retain the behavior in Spark 3.5.0, provide appropriate warnings, and will align the behavior with pandas in Spark 4.0.0. - {*}Bug Fixes{*}: Bug fixes mainly related to correctness issues will be fixed in pandas 3.5.0. *To recap, all breaking changes related to pandas 2.0.0 will be supported in Spark 4.0.0,* *and will remain deprecated with appropriate errors in Spark 3.5.0.* Will submit a PR that deprecates all APIs and adds warnings very soon. > Match behavior for DataFrame.cov on string DataFrame > > > Key: SPARK-43291 > URL: https://issues.apache.org/jira/browse/SPARK-43291 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Priority: Major > > Should enable test below: > {code:java} > pdf = pd.DataFrame([("1", "2"), ("0", "3"), ("2", "0"), ("1", "1")], > columns=["a", "b"]) > psdf = ps.from_pandas(pdf) > self.assert_eq(pdf.cov(), psdf.cov()) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44066) Support positional parameters in parameterized query
Max Gekk created SPARK-44066: Summary: Support positional parameters in parameterized query Key: SPARK-44066 URL: https://issues.apache.org/jira/browse/SPARK-44066 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.5.0 Reporter: Max Gekk Assignee: Max Gekk As a follow-up to the parameterized query we added recently, we’d like to support positional parameters. This is part of the SQL standard and JDBC/ODBC protocol. Example: update COFFEES set TOTAL = TOTAL + ? where COF_NAME = ? Note that positional and named param marker cannot be used in the same query. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44065) Optimize BroadcastHashJoin skew when localShuffleReader is disabled
Zhen Wang created SPARK-44065: - Summary: Optimize BroadcastHashJoin skew when localShuffleReader is disabled Key: SPARK-44065 URL: https://issues.apache.org/jira/browse/SPARK-44065 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Zhen Wang In RemoteShuffleService services such as uniffle and celeborn, it is recommended to disable localShuffleReader by default for better performance. But it may make BroadcastHashJoin skewed, so I want to optimize BroadcastHashJoin skew in OptimizeSkewedJoin when localShuffleReader is disabled. Refer to: https://github.com/apache/incubator-celeborn#spark-configuration https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md#support-spark-aqe -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44031) Upgrade silencer to 1.7.13
[ https://issues.apache.org/jira/browse/SPARK-44031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-44031. - Fix Version/s: 3.5.0 Assignee: Dongjoon Hyun Resolution: Fixed > Upgrade silencer to 1.7.13 > -- > > Key: SPARK-44031 > URL: https://issues.apache.org/jira/browse/SPARK-44031 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43627) Enable pyspark.pandas.spark.functions.skew in Spark Connect.
[ https://issues.apache.org/jira/browse/SPARK-43627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43627. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41604 [https://github.com/apache/spark/pull/41604] > Enable pyspark.pandas.spark.functions.skew in Spark Connect. > > > Key: SPARK-43627 > URL: https://issues.apache.org/jira/browse/SPARK-43627 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0 > > > Enable pyspark.pandas.spark.functions.skew in Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-43626) Enable pyspark.pandas.spark.functions.kurt in Spark Connect.
[ https://issues.apache.org/jira/browse/SPARK-43626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-43626. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41604 [https://github.com/apache/spark/pull/41604] > Enable pyspark.pandas.spark.functions.kurt in Spark Connect. > > > Key: SPARK-43626 > URL: https://issues.apache.org/jira/browse/SPARK-43626 > Project: Spark > Issue Type: Sub-task > Components: Connect, Pandas API on Spark >Affects Versions: 3.5.0 >Reporter: Haejoon Lee >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0 > > > Enable pyspark.pandas.spark.functions.kurt in Spark Connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org