[jira] [Assigned] (SPARK-42418) Updating PySpark documentation to support new users better
[ https://issues.apache.org/jira/browse/SPARK-42418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42418: Assignee: Apache Spark > Updating PySpark documentation to support new users better > -- > > Key: SPARK-42418 > URL: https://issues.apache.org/jira/browse/SPARK-42418 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Allan Folting >Assignee: Apache Spark >Priority: Major > > This is the first of a series of updates to the PySpark documentation site to > better guide new users on what to use and when as well as help improve > discoverability of related pages/resources. > * Add "Overview" to the top navigation bar to make it easy to get back to > the main page (clicking the logo is not super discoverable) > * Break architecture image into separate, clickable parts for easy > navigation to information for each part > * Added links to related topics under each area description > * Added date and version to the page -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42418) Updating PySpark documentation to support new users better
[ https://issues.apache.org/jira/browse/SPARK-42418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42418: Assignee: (was: Apache Spark) > Updating PySpark documentation to support new users better > -- > > Key: SPARK-42418 > URL: https://issues.apache.org/jira/browse/SPARK-42418 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Allan Folting >Priority: Major > > This is the first of a series of updates to the PySpark documentation site to > better guide new users on what to use and when as well as help improve > discoverability of related pages/resources. > * Add "Overview" to the top navigation bar to make it easy to get back to > the main page (clicking the logo is not super discoverable) > * Break architecture image into separate, clickable parts for easy > navigation to information for each part > * Added links to related topics under each area description > * Added date and version to the page -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42418) Updating PySpark documentation to support new users better
[ https://issues.apache.org/jira/browse/SPARK-42418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687775#comment-17687775 ] Apache Spark commented on SPARK-42418: -- User 'allanf-db' has created a pull request for this issue: https://github.com/apache/spark/pull/39992 > Updating PySpark documentation to support new users better > -- > > Key: SPARK-42418 > URL: https://issues.apache.org/jira/browse/SPARK-42418 > Project: Spark > Issue Type: Documentation > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Allan Folting >Priority: Major > > This is the first of a series of updates to the PySpark documentation site to > better guide new users on what to use and when as well as help improve > discoverability of related pages/resources. > * Add "Overview" to the top navigation bar to make it easy to get back to > the main page (clicking the logo is not super discoverable) > * Break architecture image into separate, clickable parts for easy > navigation to information for each part > * Added links to related topics under each area description > * Added date and version to the page -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42193) dataframe API filter criteria throwing ParseException when reading a JDBC column name with special characters
[ https://issues.apache.org/jira/browse/SPARK-42193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687770#comment-17687770 ] Dongjoon Hyun commented on SPARK-42193: --- +1 for [~maxgekk]'s assessment. > dataframe API filter criteria throwing ParseException when reading a JDBC > column name with special characters > - > > Key: SPARK-42193 > URL: https://issues.apache.org/jira/browse/SPARK-42193 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shanmugavel Kuttiyandi Chandrakasu >Priority: Minor > > *On Spark 3.3.0,* when reading from a JDBC table(used SQLite to repro) using > spark.read.jdbc command with sqlite-jdbc:3.34.0.jar on a table and column > name containing special characters. Dataframe API filter criteria fails with > parse Exception > *[#Script:]* > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession \ > .builder \ > .appName("Databricks Support") \ > .config("spark.jars.packages", "org.xerial:sqlite-jdbc:3.34.0") \ > .getOrCreate() > columns = ["id", "/abc/column", "value"] > data = [(1, 'A', 100), (2, 'B', 200), (3, 'B', 300)] > rdd = spark.sparkContext.parallelize(data) > df = spark.createDataFrame(rdd).toDF(*columns) > options = {"url": > "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db", "dbtable": > '"/abc/table"', "driver": "org.sqlite.JDBC"} > df.coalesce(1).write.format("jdbc").options(**options).mode("append").save() > df_1 = spark.read.format("jdbc") \ > .option("url", > "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db") \ > .option("dbtable", '"/abc/table"') \ > .option("driver", "org.sqlite.JDBC") \ > .load() > df_2 = df_1.filter("`/abc/column` = 'B'") > df_2.show() {code} > Error: > {code:java} > ``` Traceback (most recent call last): > File "", line 1, in > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/dataframe.py", > line 606, in show > print(self._jdf.showString(n, 20, vertical)) > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", > line 1321, in __call__ > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/utils.py", > line 196, in deco > raise converted from None > pyspark.sql.utils.ParseException: > Syntax error at or near '/': extra input '/'(line 1, pos 0) > == SQL == > /abc/column > ^^^``` {code} > However, when using Spark 3.2.1, we are able to successfully apply > dataframe.filter option > {code:java} > >>> df_2.show() > +---+---+-+ > | id|/abc/column|value| > +---+---+-+ > | 2| B| 200| > | 3| B| 300| > +---+---+-+ {code} > *Repro steps:* > # Download [Spark 3.2.1 in local > |https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz] > # Download and Copy the sqlite-jdbc:3.34.0.jar into the jar folder present > in the local spark download folder > # Run the above [#script] by providing the jar path > # This will create a */abc/table* with column */abc/column* and returns > result when applying filter criteria > # Download spark ** [3.3.0 in > local|https://www.apache.org/dyn/closer.lua/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz] > # Repeat #2, #3 > # Fails with parse exception. > could you please let us know how we can filter on the special characters > column or escape them on spark version 3.3.0? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42193) dataframe API filter criteria throwing ParseException when reading a JDBC column name with special characters
[ https://issues.apache.org/jira/browse/SPARK-42193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42193: - Component/s: SQL (was: Spark Core) > dataframe API filter criteria throwing ParseException when reading a JDBC > column name with special characters > - > > Key: SPARK-42193 > URL: https://issues.apache.org/jira/browse/SPARK-42193 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shanmugavel Kuttiyandi Chandrakasu >Priority: Minor > > *On Spark 3.3.0,* when reading from a JDBC table(used SQLite to repro) using > spark.read.jdbc command with sqlite-jdbc:3.34.0.jar on a table and column > name containing special characters. Dataframe API filter criteria fails with > parse Exception > *[#Script:]* > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession \ > .builder \ > .appName("Databricks Support") \ > .config("spark.jars.packages", "org.xerial:sqlite-jdbc:3.34.0") \ > .getOrCreate() > columns = ["id", "/abc/column", "value"] > data = [(1, 'A', 100), (2, 'B', 200), (3, 'B', 300)] > rdd = spark.sparkContext.parallelize(data) > df = spark.createDataFrame(rdd).toDF(*columns) > options = {"url": > "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db", "dbtable": > '"/abc/table"', "driver": "org.sqlite.JDBC"} > df.coalesce(1).write.format("jdbc").options(**options).mode("append").save() > df_1 = spark.read.format("jdbc") \ > .option("url", > "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db") \ > .option("dbtable", '"/abc/table"') \ > .option("driver", "org.sqlite.JDBC") \ > .load() > df_2 = df_1.filter("`/abc/column` = 'B'") > df_2.show() {code} > Error: > {code:java} > ``` Traceback (most recent call last): > File "", line 1, in > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/dataframe.py", > line 606, in show > print(self._jdf.showString(n, 20, vertical)) > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", > line 1321, in __call__ > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/utils.py", > line 196, in deco > raise converted from None > pyspark.sql.utils.ParseException: > Syntax error at or near '/': extra input '/'(line 1, pos 0) > == SQL == > /abc/column > ^^^``` {code} > However, when using Spark 3.2.1, we are able to successfully apply > dataframe.filter option > {code:java} > >>> df_2.show() > +---+---+-+ > | id|/abc/column|value| > +---+---+-+ > | 2| B| 200| > | 3| B| 300| > +---+---+-+ {code} > *Repro steps:* > # Download [Spark 3.2.1 in local > |https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz] > # Download and Copy the sqlite-jdbc:3.34.0.jar into the jar folder present > in the local spark download folder > # Run the above [#script] by providing the jar path > # This will create a */abc/table* with column */abc/column* and returns > result when applying filter criteria > # Download spark ** [3.3.0 in > local|https://www.apache.org/dyn/closer.lua/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz] > # Repeat #2, #3 > # Fails with parse exception. > could you please let us know how we can filter on the special characters > column or escape them on spark version 3.3.0? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42193) dataframe API filter criteria throwing ParseException when reading a JDBC column name with special characters
[ https://issues.apache.org/jira/browse/SPARK-42193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42193. -- Resolution: Cannot Reproduce > dataframe API filter criteria throwing ParseException when reading a JDBC > column name with special characters > - > > Key: SPARK-42193 > URL: https://issues.apache.org/jira/browse/SPARK-42193 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Shanmugavel Kuttiyandi Chandrakasu >Priority: Minor > > *On Spark 3.3.0,* when reading from a JDBC table(used SQLite to repro) using > spark.read.jdbc command with sqlite-jdbc:3.34.0.jar on a table and column > name containing special characters. Dataframe API filter criteria fails with > parse Exception > *[#Script:]* > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession \ > .builder \ > .appName("Databricks Support") \ > .config("spark.jars.packages", "org.xerial:sqlite-jdbc:3.34.0") \ > .getOrCreate() > columns = ["id", "/abc/column", "value"] > data = [(1, 'A', 100), (2, 'B', 200), (3, 'B', 300)] > rdd = spark.sparkContext.parallelize(data) > df = spark.createDataFrame(rdd).toDF(*columns) > options = {"url": > "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db", "dbtable": > '"/abc/table"', "driver": "org.sqlite.JDBC"} > df.coalesce(1).write.format("jdbc").options(**options).mode("append").save() > df_1 = spark.read.format("jdbc") \ > .option("url", > "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db") \ > .option("dbtable", '"/abc/table"') \ > .option("driver", "org.sqlite.JDBC") \ > .load() > df_2 = df_1.filter("`/abc/column` = 'B'") > df_2.show() {code} > Error: > {code:java} > ``` Traceback (most recent call last): > File "", line 1, in > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/dataframe.py", > line 606, in show > print(self._jdf.showString(n, 20, vertical)) > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", > line 1321, in __call__ > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/utils.py", > line 196, in deco > raise converted from None > pyspark.sql.utils.ParseException: > Syntax error at or near '/': extra input '/'(line 1, pos 0) > == SQL == > /abc/column > ^^^``` {code} > However, when using Spark 3.2.1, we are able to successfully apply > dataframe.filter option > {code:java} > >>> df_2.show() > +---+---+-+ > | id|/abc/column|value| > +---+---+-+ > | 2| B| 200| > | 3| B| 300| > +---+---+-+ {code} > *Repro steps:* > # Download [Spark 3.2.1 in local > |https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz] > # Download and Copy the sqlite-jdbc:3.34.0.jar into the jar folder present > in the local spark download folder > # Run the above [#script] by providing the jar path > # This will create a */abc/table* with column */abc/column* and returns > result when applying filter criteria > # Download spark ** [3.3.0 in > local|https://www.apache.org/dyn/closer.lua/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz] > # Repeat #2, #3 > # Fails with parse exception. > could you please let us know how we can filter on the special characters > column or escape them on spark version 3.3.0? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2
[ https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687765#comment-17687765 ] Hyukjin Kwon commented on SPARK-42227: -- How much fast is it? > Use approx_percentile function running slower in spark3 than spark2 > --- > > Key: SPARK-42227 > URL: https://issues.apache.org/jira/browse/SPARK-42227 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 >Reporter: xuanzhiang >Priority: Major > > approx_percentile(end_ts-start_ts,0.9) cost_p90 > in spark3 , it use objectHashAggregate method , but it shuffle very slow. > when i use percentile , it become fast. i dont know the reson, i think > approx_percentile should fast. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42293) why executor memory used is shown greater than total available memory on spark ui
[ https://issues.apache.org/jira/browse/SPARK-42293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687763#comment-17687763 ] Hyukjin Kwon commented on SPARK-42293: -- [~handong] mind sharing a reproducer if you have? > why executor memory used is shown greater than total available memory on > spark ui > - > > Key: SPARK-42293 > URL: https://issues.apache.org/jira/browse/SPARK-42293 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5 >Reporter: handong >Priority: Major > > *I have a spark streaming job that is running for around last 3 weeks. When > I open the Executors tab on spark web UI, it shows* > # {{memory used - 36.1GB}} > # total available memory for storage - 3.2GB > *Please refer to the below screenshot of Spark UI* > !https://i.stack.imgur.com/nmk39.jpg! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.
[ https://issues.apache.org/jira/browse/SPARK-42419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687764#comment-17687764 ] Apache Spark commented on SPARK-42419: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39991 > Migrate `TypeError` into error framework for Spark Connect column API. > -- > > Key: SPARK-42419 > URL: https://issues.apache.org/jira/browse/SPARK-42419 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should migrate all errors into PySpark error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.
[ https://issues.apache.org/jira/browse/SPARK-42419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42419: Assignee: Apache Spark > Migrate `TypeError` into error framework for Spark Connect column API. > -- > > Key: SPARK-42419 > URL: https://issues.apache.org/jira/browse/SPARK-42419 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > We should migrate all errors into PySpark error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.
[ https://issues.apache.org/jira/browse/SPARK-42419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687762#comment-17687762 ] Apache Spark commented on SPARK-42419: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39991 > Migrate `TypeError` into error framework for Spark Connect column API. > -- > > Key: SPARK-42419 > URL: https://issues.apache.org/jira/browse/SPARK-42419 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should migrate all errors into PySpark error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.
[ https://issues.apache.org/jira/browse/SPARK-42419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42419: Assignee: (was: Apache Spark) > Migrate `TypeError` into error framework for Spark Connect column API. > -- > > Key: SPARK-42419 > URL: https://issues.apache.org/jira/browse/SPARK-42419 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > We should migrate all errors into PySpark error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42387) Avoid unnecessary parquet footer reads when no filters
[ https://issues.apache.org/jira/browse/SPARK-42387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687760#comment-17687760 ] Hyukjin Kwon commented on SPARK-42387: -- [~miracle] mind filling JIRA description please > Avoid unnecessary parquet footer reads when no filters > -- > > Key: SPARK-42387 > URL: https://issues.apache.org/jira/browse/SPARK-42387 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Mars >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42397) Inconsistent data produced by `FlatMapCoGroupsInPandas`
[ https://issues.apache.org/jira/browse/SPARK-42397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687759#comment-17687759 ] Hyukjin Kwon commented on SPARK-42397: -- It's probably related the order which Spark doesn't guarantee. Is the actual value different? > Inconsistent data produced by `FlatMapCoGroupsInPandas` > --- > > Key: SPARK-42397 > URL: https://issues.apache.org/jira/browse/SPARK-42397 > Project: Spark > Issue Type: Bug > Components: Pandas API on Spark, SQL >Affects Versions: 3.3.0, 3.3.1 >Reporter: Ted Chester Jenks >Priority: Minor > > We are seeing inconsistent data returned when using > `FlatMapCoGroupsInPandas`. In the PySpark example from the comments, when we > call `grouped_df.collect()` we get: > > {{[Row(left_colms="Index(['cluster', 'event', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')")] }} > > When we call `grouped_df.show(5, truncate=False)` we get: > > {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')", > xyz='1234')] }} > > When we call `grouped_df_1.collect()` we get: > > {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", > right_colms="Index(['cluster', 'event', 'def'], dtype='object')", > xyz='1234')] }} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results
[ https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42399: - Component/s: SQL (was: Spark Core) > CONV() silently overflows returning wrong results > - > > Key: SPARK-42399 > URL: https://issues.apache.org/jira/browse/SPARK-42399 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Critical > > spark-sql> SELECT > CONV(SUBSTRING('0x', > 3), 16, 10); > 18446744073709551615 > Time taken: 2.114 seconds, Fetched 1 row(s) > spark-sql> set spark.sql.ansi.enabled = true; > spark.sql.ansi.enabled true > Time taken: 0.068 seconds, Fetched 1 row(s) > spark-sql> SELECT > CONV(SUBSTRING('0x', > 3), 16, 10); > 18446744073709551615 > Time taken: 0.05 seconds, Fetched 1 row(s) > In ANSI mode we should raise an error for sure. > In non ANSI either an error or a NULL maybe be acceptable. > Alternatively, of course, we could consider if we can support arbitrary > domains since the result is a STRING again. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append
[ https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42401: Assignee: Bruce Robbins > Incorrect results or NPE when inserting null value into array using > array_insert/array_append > - > > Key: SPARK-42401 > URL: https://issues.apache.org/jira/browse/SPARK-42401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Labels: correctness > > Example: > {noformat} > create or replace temp view v1 as > select * from values > (array(1, 2, 3, 4), 5, 5), > (array(1, 2, 3, 4), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > This produces an incorrect result: > {noformat} > [1,2,3,4,5] > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > A more succint example: > {noformat} > select array_insert(array(1, 2, 3, 4), 5, cast(null as int)); > {noformat} > This also produces an incorrect result: > {noformat} > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > Another example: > {noformat} > create or replace temp view v1 as > select * from values > (array('1', '2', '3', '4'), 5, '5'), > (array('1', '2', '3', '4'), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > The above query throws a {{NullPointerException}}: > {noformat} > 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, > col2, col3) from v1] > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44) > {noformat} > {{array_append}} has the same issue: > {noformat} > spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int)); > [1,2,3,4,0] <== should be [1,2,3,4,null] > Time taken: 3.679 seconds, Fetched 1 row(s) > spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as > string)); > 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append
[ https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42401. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39970 [https://github.com/apache/spark/pull/39970] > Incorrect results or NPE when inserting null value into array using > array_insert/array_append > - > > Key: SPARK-42401 > URL: https://issues.apache.org/jira/browse/SPARK-42401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Assignee: Bruce Robbins >Priority: Major > Labels: correctness > Fix For: 3.4.0 > > > Example: > {noformat} > create or replace temp view v1 as > select * from values > (array(1, 2, 3, 4), 5, 5), > (array(1, 2, 3, 4), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > This produces an incorrect result: > {noformat} > [1,2,3,4,5] > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > A more succint example: > {noformat} > select array_insert(array(1, 2, 3, 4), 5, cast(null as int)); > {noformat} > This also produces an incorrect result: > {noformat} > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > Another example: > {noformat} > create or replace temp view v1 as > select * from values > (array('1', '2', '3', '4'), 5, '5'), > (array('1', '2', '3', '4'), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > The above query throws a {{NullPointerException}}: > {noformat} > 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, > col2, col3) from v1] > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44) > {noformat} > {{array_append}} has the same issue: > {noformat} > spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int)); > [1,2,3,4,0] <== should be [1,2,3,4,null] > Time taken: 3.679 seconds, Fetched 1 row(s) > spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as > string)); > 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.
Haejoon Lee created SPARK-42419: --- Summary: Migrate `TypeError` into error framework for Spark Connect column API. Key: SPARK-42419 URL: https://issues.apache.org/jira/browse/SPARK-42419 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Haejoon Lee We should migrate all errors into PySpark error framework. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42258) pyspark.sql.functions should not expose typing.cast
[ https://issues.apache.org/jira/browse/SPARK-42258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687757#comment-17687757 ] Hyukjin Kwon commented on SPARK-42258: -- Good point. Are you interested in submitting a PR? > pyspark.sql.functions should not expose typing.cast > --- > > Key: SPARK-42258 > URL: https://issues.apache.org/jira/browse/SPARK-42258 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.1 >Reporter: Furcy Pin >Priority: Minor > > In pyspark, the `pyspark.sql.functions` modules imports and exposes the > method `typing.cast`. > This may lead to errors from users that can be hard to spot. > *Example* > It took me a few minutes to understand why the following code: > > {code:java} > from pyspark.sql import SparkSession > from pyspark.sql import functions as f > spark = SparkSession.builder.getOrCreate() > df = spark.sql("""SELECT 1 as a""") > df.withColumn("a", f.cast("STRING", f.col("a"))).printSchema() {code} > which executes without any problem, gives the following result: > > > {code:java} > root > |-- a: integer (nullable = false){code} > This is because `f.cast` here calls `typing.cast, and the correct syntax is: > {code:java} > df.withColumn("a", f.col("a").cast("STRING")).printSchema(){code} > > which indeed gives: > {code:java} > root > |-- a: string (nullable = false) {code} > *Suggestion of solution* > Option 1: The methods imported in the module `pyspark.sql.functions` could be > obfuscated to prevent this. For instance: > {code:java} > from typing import cast as _cast{code} > Option 2: only import `typing` and replace all occurrences of `cast` with > `typing.cast` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42407) `with as` executed again
[ https://issues.apache.org/jira/browse/SPARK-42407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42407: - Priority: Major (was: Critical) > `with as` executed again > > > Key: SPARK-42407 > URL: https://issues.apache.org/jira/browse/SPARK-42407 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.3 >Reporter: yiku123 >Priority: Major > > When 'with as' is used multiple times, it will be executed again each time > without saving the results of' with as', resulting in low efficiency. > Will you consider improving the behavior of 'with as' > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42193) dataframe API filter criteria throwing ParseException when reading a JDBC column name with special characters
[ https://issues.apache.org/jira/browse/SPARK-42193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687756#comment-17687756 ] Max Gekk commented on SPARK-42193: -- I haven't reproduced the issue on the recent master. Seems like it has been already fixed by [~huaxingao] in https://issues.apache.org/jira/browse/SPARK-41990 also cc [~dongjoon] > dataframe API filter criteria throwing ParseException when reading a JDBC > column name with special characters > - > > Key: SPARK-42193 > URL: https://issues.apache.org/jira/browse/SPARK-42193 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Shanmugavel Kuttiyandi Chandrakasu >Priority: Minor > > *On Spark 3.3.0,* when reading from a JDBC table(used SQLite to repro) using > spark.read.jdbc command with sqlite-jdbc:3.34.0.jar on a table and column > name containing special characters. Dataframe API filter criteria fails with > parse Exception > *[#Script:]* > {code:java} > from pyspark.sql import SparkSession > spark = SparkSession \ > .builder \ > .appName("Databricks Support") \ > .config("spark.jars.packages", "org.xerial:sqlite-jdbc:3.34.0") \ > .getOrCreate() > columns = ["id", "/abc/column", "value"] > data = [(1, 'A', 100), (2, 'B', 200), (3, 'B', 300)] > rdd = spark.sparkContext.parallelize(data) > df = spark.createDataFrame(rdd).toDF(*columns) > options = {"url": > "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db", "dbtable": > '"/abc/table"', "driver": "org.sqlite.JDBC"} > df.coalesce(1).write.format("jdbc").options(**options).mode("append").save() > df_1 = spark.read.format("jdbc") \ > .option("url", > "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db") \ > .option("dbtable", '"/abc/table"') \ > .option("driver", "org.sqlite.JDBC") \ > .load() > df_2 = df_1.filter("`/abc/column` = 'B'") > df_2.show() {code} > Error: > {code:java} > ``` Traceback (most recent call last): > File "", line 1, in > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/dataframe.py", > line 606, in show > print(self._jdf.showString(n, 20, vertical)) > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", > line 1321, in __call__ > File > "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/utils.py", > line 196, in deco > raise converted from None > pyspark.sql.utils.ParseException: > Syntax error at or near '/': extra input '/'(line 1, pos 0) > == SQL == > /abc/column > ^^^``` {code} > However, when using Spark 3.2.1, we are able to successfully apply > dataframe.filter option > {code:java} > >>> df_2.show() > +---+---+-+ > | id|/abc/column|value| > +---+---+-+ > | 2| B| 200| > | 3| B| 300| > +---+---+-+ {code} > *Repro steps:* > # Download [Spark 3.2.1 in local > |https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz] > # Download and Copy the sqlite-jdbc:3.34.0.jar into the jar folder present > in the local spark download folder > # Run the above [#script] by providing the jar path > # This will create a */abc/table* with column */abc/column* and returns > result when applying filter criteria > # Download spark ** [3.3.0 in > local|https://www.apache.org/dyn/closer.lua/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz] > # Repeat #2, #3 > # Fails with parse exception. > could you please let us know how we can filter on the special characters > column or escape them on spark version 3.3.0? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42417) Upgrade `netty` to version 4.1.88.Final
[ https://issues.apache.org/jira/browse/SPARK-42417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42417: Assignee: (was: Apache Spark) > Upgrade `netty` to version 4.1.88.Final > > > Key: SPARK-42417 > URL: https://issues.apache.org/jira/browse/SPARK-42417 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42415) The built-in dialects support OFFSET and paging query.
[ https://issues.apache.org/jira/browse/SPARK-42415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42415: Assignee: Apache Spark > The built-in dialects support OFFSET and paging query. > -- > > Key: SPARK-42415 > URL: https://issues.apache.org/jira/browse/SPARK-42415 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42415) The built-in dialects support OFFSET and paging query.
[ https://issues.apache.org/jira/browse/SPARK-42415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687753#comment-17687753 ] Apache Spark commented on SPARK-42415: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/39990 > The built-in dialects support OFFSET and paging query. > -- > > Key: SPARK-42415 > URL: https://issues.apache.org/jira/browse/SPARK-42415 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42415) The built-in dialects support OFFSET and paging query.
[ https://issues.apache.org/jira/browse/SPARK-42415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42415: Assignee: (was: Apache Spark) > The built-in dialects support OFFSET and paging query. > -- > > Key: SPARK-42415 > URL: https://issues.apache.org/jira/browse/SPARK-42415 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42417) Upgrade `netty` to version 4.1.88.Final
[ https://issues.apache.org/jira/browse/SPARK-42417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687754#comment-17687754 ] Apache Spark commented on SPARK-42417: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/39989 > Upgrade `netty` to version 4.1.88.Final > > > Key: SPARK-42417 > URL: https://issues.apache.org/jira/browse/SPARK-42417 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42417) Upgrade `netty` to version 4.1.88.Final
[ https://issues.apache.org/jira/browse/SPARK-42417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42417: Assignee: Apache Spark > Upgrade `netty` to version 4.1.88.Final > > > Key: SPARK-42417 > URL: https://issues.apache.org/jira/browse/SPARK-42417 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42418) Updating PySpark documentation to support new users better
Allan Folting created SPARK-42418: - Summary: Updating PySpark documentation to support new users better Key: SPARK-42418 URL: https://issues.apache.org/jira/browse/SPARK-42418 Project: Spark Issue Type: Documentation Components: PySpark Affects Versions: 3.4.0 Reporter: Allan Folting This is the first of a series of updates to the PySpark documentation site to better guide new users on what to use and when as well as help improve discoverability of related pages/resources. * Add "Overview" to the top navigation bar to make it easy to get back to the main page (clicking the logo is not super discoverable) * Break architecture image into separate, clickable parts for easy navigation to information for each part * Added links to related topics under each area description * Added date and version to the page -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42416) Dateset operations should not resolve the analyzed logical plan again
[ https://issues.apache.org/jira/browse/SPARK-42416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-42416: --- Summary: Dateset operations should not resolve the analyzed logical plan again (was: Dateset.show() should not resolve the analyzed logical plan again) > Dateset operations should not resolve the analyzed logical plan again > - > > Key: SPARK-42416 > URL: https://issues.apache.org/jira/browse/SPARK-42416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > For the following query > > {code:java} > sql( > """ > |CREATE TABLE app_open ( > | uid STRING, > | st TIMESTAMP, > | ds INT > |) USING parquet PARTITIONED BY (ds); > |""".stripMargin) > sql( > """ > |create or replace temporary view group_by_error as WITH > new_app_open AS ( > | SELECT > |ao.* > | FROM > |app_open ao > |) > |SELECT > |uid, > |20230208 AS ds > | FROM > |new_app_open > | GROUP BY > |1, > |2 > |""".stripMargin) > sql( > """ > |select > | `uid` > |from > | group_by_error > |""".stripMargin).show(){code} > Spark will throw the following error > > > {code:java} > [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list > (valid range is [1, 2]).; line 9 pos 4 {code} > > > This is because the logical plan is not set as analyzed and it is analyzed > again. The analyzer rules about aggregation/sort ordinals are not idempotent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42417) Upgrade `netty` to version 4.1.88.Final
BingKun Pan created SPARK-42417: --- Summary: Upgrade `netty` to version 4.1.88.Final Key: SPARK-42417 URL: https://issues.apache.org/jira/browse/SPARK-42417 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42416) Dateset.show() should not resolve the analyzed logical plan again
[ https://issues.apache.org/jira/browse/SPARK-42416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42416: Assignee: Gengliang Wang (was: Apache Spark) > Dateset.show() should not resolve the analyzed logical plan again > - > > Key: SPARK-42416 > URL: https://issues.apache.org/jira/browse/SPARK-42416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > For the following query > > {code:java} > sql( > """ > |CREATE TABLE app_open ( > | uid STRING, > | st TIMESTAMP, > | ds INT > |) USING parquet PARTITIONED BY (ds); > |""".stripMargin) > sql( > """ > |create or replace temporary view group_by_error as WITH > new_app_open AS ( > | SELECT > |ao.* > | FROM > |app_open ao > |) > |SELECT > |uid, > |20230208 AS ds > | FROM > |new_app_open > | GROUP BY > |1, > |2 > |""".stripMargin) > sql( > """ > |select > | `uid` > |from > | group_by_error > |""".stripMargin).show(){code} > Spark will throw the following error > > > {code:java} > [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list > (valid range is [1, 2]).; line 9 pos 4 {code} > > > This is because the logical plan is not set as analyzed and it is analyzed > again. The analyzer rules about aggregation/sort ordinals are not idempotent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42416) Dateset.show() should not resolve the analyzed logical plan again
[ https://issues.apache.org/jira/browse/SPARK-42416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42416: Assignee: Apache Spark (was: Gengliang Wang) > Dateset.show() should not resolve the analyzed logical plan again > - > > Key: SPARK-42416 > URL: https://issues.apache.org/jira/browse/SPARK-42416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > For the following query > > {code:java} > sql( > """ > |CREATE TABLE app_open ( > | uid STRING, > | st TIMESTAMP, > | ds INT > |) USING parquet PARTITIONED BY (ds); > |""".stripMargin) > sql( > """ > |create or replace temporary view group_by_error as WITH > new_app_open AS ( > | SELECT > |ao.* > | FROM > |app_open ao > |) > |SELECT > |uid, > |20230208 AS ds > | FROM > |new_app_open > | GROUP BY > |1, > |2 > |""".stripMargin) > sql( > """ > |select > | `uid` > |from > | group_by_error > |""".stripMargin).show(){code} > Spark will throw the following error > > > {code:java} > [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list > (valid range is [1, 2]).; line 9 pos 4 {code} > > > This is because the logical plan is not set as analyzed and it is analyzed > again. The analyzer rules about aggregation/sort ordinals are not idempotent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42416) Dateset.show() should not resolve the analyzed logical plan again
[ https://issues.apache.org/jira/browse/SPARK-42416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687747#comment-17687747 ] Apache Spark commented on SPARK-42416: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/39988 > Dateset.show() should not resolve the analyzed logical plan again > - > > Key: SPARK-42416 > URL: https://issues.apache.org/jira/browse/SPARK-42416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > For the following query > > {code:java} > sql( > """ > |CREATE TABLE app_open ( > | uid STRING, > | st TIMESTAMP, > | ds INT > |) USING parquet PARTITIONED BY (ds); > |""".stripMargin) > sql( > """ > |create or replace temporary view group_by_error as WITH > new_app_open AS ( > | SELECT > |ao.* > | FROM > |app_open ao > |) > |SELECT > |uid, > |20230208 AS ds > | FROM > |new_app_open > | GROUP BY > |1, > |2 > |""".stripMargin) > sql( > """ > |select > | `uid` > |from > | group_by_error > |""".stripMargin).show(){code} > Spark will throw the following error > > > {code:java} > [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list > (valid range is [1, 2]).; line 9 pos 4 {code} > > > This is because the logical plan is not set as analyzed and it is analyzed > again. The analyzer rules about aggregation/sort ordinals are not idempotent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec
[ https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-14922: -- Target Version/s: (was: 3.5.0) > Alter Table Drop Partition Using Predicate-based Partition Spec > --- > > Key: SPARK-14922 > URL: https://issues.apache.org/jira/browse/SPARK-14922 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.2, 2.2.1 >Reporter: Xiao Li >Priority: Major > > Below is allowed in Hive, but not allowed in Spark. > {noformat} > alter table ptestfilter drop partition (c='US', d<'2') > {noformat} > This example is copied from drop_partitions_filter.q -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications
[ https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-41053: -- Fix Version/s: 3.4.0 > Better Spark UI scalability and Driver stability for large applications > --- > > Key: SPARK-41053 > URL: https://issues.apache.org/jira/browse/SPARK-41053 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: releasenotes > Fix For: 3.4.0 > > Attachments: Better Spark UI scalability and Driver stability for > large applications.pdf > > > After SPARK-18085, the Spark history server(SHS) becomes more scalable for > processing large applications by supporting a persistent > KV-store(LevelDB/RocksDB) as the storage layer. > As for the live Spark UI, all the data is still stored in memory, which can > bring memory pressures to the Spark driver for large applications. > For better Spark UI scalability and Driver stability, I propose to > * {*}Support storing all the UI data in a persistent KV store{*}. > RocksDB/LevelDB provides low memory overhead. Their write/read performance is > fast enough to serve the write/read workload for live UI. SHS can leverage > the persistent KV store to fasten its startup. > * *Support a new Protobuf serializer for all the UI data.* The new > serializer is supposed to be faster, according to benchmarks. It will be the > default serializer for the persistent KV store of live UI. As for event logs, > it is optional. The current serializer for UI data is JSON. When writing > persistent KV-store, there is GZip compression. Since there is compression > support in RocksDB/LevelDB, the new serializer won’t compress the output > before writing to the persistent KV store. Here is a benchmark of > writing/reading 100,000 SQLExecutionUIData to/from RocksDB: > > |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total > Size(MB)*|*Result total size in memory(MB)*| > |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868| > |*Protobuf*|109.9|34.3|858|2105| > I am also proposing to support RocksDB instead of both LevelDB & RocksDB in > the live UI. > SPIP: > [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing] > SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications
[ https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687745#comment-17687745 ] Dongjoon Hyun commented on SPARK-41053: --- Thank you for leading and completing this, [~Gengliang.Wang]. I assigned this issue to Gengliang to shine his leadership. Thank you all. > Better Spark UI scalability and Driver stability for large applications > --- > > Key: SPARK-41053 > URL: https://issues.apache.org/jira/browse/SPARK-41053 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: releasenotes > Attachments: Better Spark UI scalability and Driver stability for > large applications.pdf > > > After SPARK-18085, the Spark history server(SHS) becomes more scalable for > processing large applications by supporting a persistent > KV-store(LevelDB/RocksDB) as the storage layer. > As for the live Spark UI, all the data is still stored in memory, which can > bring memory pressures to the Spark driver for large applications. > For better Spark UI scalability and Driver stability, I propose to > * {*}Support storing all the UI data in a persistent KV store{*}. > RocksDB/LevelDB provides low memory overhead. Their write/read performance is > fast enough to serve the write/read workload for live UI. SHS can leverage > the persistent KV store to fasten its startup. > * *Support a new Protobuf serializer for all the UI data.* The new > serializer is supposed to be faster, according to benchmarks. It will be the > default serializer for the persistent KV store of live UI. As for event logs, > it is optional. The current serializer for UI data is JSON. When writing > persistent KV-store, there is GZip compression. Since there is compression > support in RocksDB/LevelDB, the new serializer won’t compress the output > before writing to the persistent KV store. Here is a benchmark of > writing/reading 100,000 SQLExecutionUIData to/from RocksDB: > > |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total > Size(MB)*|*Result total size in memory(MB)*| > |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868| > |*Protobuf*|109.9|34.3|858|2105| > I am also proposing to support RocksDB instead of both LevelDB & RocksDB in > the live UI. > SPIP: > [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing] > SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications
[ https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41053: - Assignee: Gengliang Wang (was: Apache Spark) > Better Spark UI scalability and Driver stability for large applications > --- > > Key: SPARK-41053 > URL: https://issues.apache.org/jira/browse/SPARK-41053 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Labels: releasenotes > Attachments: Better Spark UI scalability and Driver stability for > large applications.pdf > > > After SPARK-18085, the Spark history server(SHS) becomes more scalable for > processing large applications by supporting a persistent > KV-store(LevelDB/RocksDB) as the storage layer. > As for the live Spark UI, all the data is still stored in memory, which can > bring memory pressures to the Spark driver for large applications. > For better Spark UI scalability and Driver stability, I propose to > * {*}Support storing all the UI data in a persistent KV store{*}. > RocksDB/LevelDB provides low memory overhead. Their write/read performance is > fast enough to serve the write/read workload for live UI. SHS can leverage > the persistent KV store to fasten its startup. > * *Support a new Protobuf serializer for all the UI data.* The new > serializer is supposed to be faster, according to benchmarks. It will be the > default serializer for the persistent KV store of live UI. As for event logs, > it is optional. The current serializer for UI data is JSON. When writing > persistent KV-store, there is GZip compression. Since there is compression > support in RocksDB/LevelDB, the new serializer won’t compress the output > before writing to the persistent KV store. Here is a benchmark of > writing/reading 100,000 SQLExecutionUIData to/from RocksDB: > > |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total > Size(MB)*|*Result total size in memory(MB)*| > |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868| > |*Protobuf*|109.9|34.3|858|2105| > I am also proposing to support RocksDB instead of both LevelDB & RocksDB in > the live UI. > SPIP: > [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing] > SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes
[ https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687744#comment-17687744 ] Dongjoon Hyun commented on SPARK-34645: --- Thank you for sharing your experience and several combinations you tried, [~hussein-awala]. - Is the JVM terminated? - If not, what kind of JVM threads do you see in the driver pod? > [K8S] Driver pod stuck in Running state after job completes > --- > > Key: SPARK-34645 > URL: https://issues.apache.org/jira/browse/SPARK-34645 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.0.2 > Environment: Kubernetes: > {code:java} > Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", > GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", > BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", > Platform:"linux/amd64"} > Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", > GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", > BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", > Platform:"linux/amd64"} > {code} >Reporter: Andy Grove >Priority: Major > > I am running automated benchmarks in k8s, using spark-submit in cluster mode, > so the driver runs in a pod. > When running with Spark 3.0.1 and 3.1.1 everything works as expected and I > see the Spark context being shut down after the job completes. > However, when running with Spark 3.0.2 I do not see the context get shut down > and the driver pod is stuck in the Running state indefinitely. > This is the output I see after job completion with 3.0.1 and 3.1.1 and this > output does not appear with 3.0.2. With 3.0.2 there is no output at all after > the job completes. > {code:java} > 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from > shutdown hook > 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped > Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040} > 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at > http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040 > 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting > down all executors > 2021-03-05 20:09:24,600 INFO > k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes > client has been closed (this is expected if the application is shutting down.) > 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared > 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped > 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster > stopped > 2021-03-05 20:09:24,752 INFO > scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped > SparkContext > 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called > 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory > /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a > 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory > /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814 > 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system > metrics system... > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system stopped. > 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics > system shutdown complete. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41775) Implement training functions as input
[ https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687743#comment-17687743 ] Apache Spark commented on SPARK-41775: -- User 'rithwik-db' has created a pull request for this issue: https://github.com/apache/spark/pull/39987 > Implement training functions as input > - > > Key: SPARK-41775 > URL: https://issues.apache.org/jira/browse/SPARK-41775 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.4.0 >Reporter: Rithwik Ediga Lakhamsani >Assignee: Rithwik Ediga Lakhamsani >Priority: Major > Fix For: 3.4.0 > > > Sidenote: make formatting updates described in > https://github.com/apache/spark/pull/39188 > > Currently, `Distributor().run(...)` takes only files as input. Now we will > add in additional functionality to take in functions as well. This will > require us to go through the following process on each task in the executor > nodes: > 1. take the input function and args and pickle them > 2. Create a temp train.py file that looks like > {code:java} > import cloudpickle > import os > if _name_ == "_main_": > train, args = cloudpickle.load(f"{tempdir}/train_input.pkl") > output = train(*args) > if output and os.environ.get("RANK", "") == "0": # this is for > partitionId == 0 > cloudpickle.dump(f"{tempdir}/train_output.pkl") {code} > 3. Run that train.py file with `torchrun` > 4. Check if `train_output.pkl` has been created on process on partitionId == > 0, if it has, then deserialize it and return that output through `.collect()` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42323) Assign name to _LEGACY_ERROR_TEMP_2332
[ https://issues.apache.org/jira/browse/SPARK-42323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42323: Assignee: (was: Apache Spark) > Assign name to _LEGACY_ERROR_TEMP_2332 > -- > > Key: SPARK-42323 > URL: https://issues.apache.org/jira/browse/SPARK-42323 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42323) Assign name to _LEGACY_ERROR_TEMP_2332
[ https://issues.apache.org/jira/browse/SPARK-42323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42323: Assignee: Apache Spark > Assign name to _LEGACY_ERROR_TEMP_2332 > -- > > Key: SPARK-42323 > URL: https://issues.apache.org/jira/browse/SPARK-42323 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42323) Assign name to _LEGACY_ERROR_TEMP_2332
[ https://issues.apache.org/jira/browse/SPARK-42323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687742#comment-17687742 ] Apache Spark commented on SPARK-42323: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/39977 > Assign name to _LEGACY_ERROR_TEMP_2332 > -- > > Key: SPARK-42323 > URL: https://issues.apache.org/jira/browse/SPARK-42323 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42416) Dateset.show() should not resolve the analyzed logical plan again
[ https://issues.apache.org/jira/browse/SPARK-42416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-42416: -- Assignee: Gengliang Wang > Dateset.show() should not resolve the analyzed logical plan again > - > > Key: SPARK-42416 > URL: https://issues.apache.org/jira/browse/SPARK-42416 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > For the following query > > {code:java} > sql( > """ > |CREATE TABLE app_open ( > | uid STRING, > | st TIMESTAMP, > | ds INT > |) USING parquet PARTITIONED BY (ds); > |""".stripMargin) > sql( > """ > |create or replace temporary view group_by_error as WITH > new_app_open AS ( > | SELECT > |ao.* > | FROM > |app_open ao > |) > |SELECT > |uid, > |20230208 AS ds > | FROM > |new_app_open > | GROUP BY > |1, > |2 > |""".stripMargin) > sql( > """ > |select > | `uid` > |from > | group_by_error > |""".stripMargin).show(){code} > Spark will throw the following error > > > {code:java} > [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list > (valid range is [1, 2]).; line 9 pos 4 {code} > > > This is because the logical plan is not set as analyzed and it is analyzed > again. The analyzer rules about aggregation/sort ordinals are not idempotent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42416) Dateset.show() should not resolve the analyzed logical plan again
Gengliang Wang created SPARK-42416: -- Summary: Dateset.show() should not resolve the analyzed logical plan again Key: SPARK-42416 URL: https://issues.apache.org/jira/browse/SPARK-42416 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang For the following query {code:java} sql( """ |CREATE TABLE app_open ( | uid STRING, | st TIMESTAMP, | ds INT |) USING parquet PARTITIONED BY (ds); |""".stripMargin) sql( """ |create or replace temporary view group_by_error as WITH new_app_open AS ( | SELECT |ao.* | FROM |app_open ao |) |SELECT |uid, |20230208 AS ds | FROM |new_app_open | GROUP BY |1, |2 |""".stripMargin) sql( """ |select | `uid` |from | group_by_error |""".stripMargin).show(){code} Spark will throw the following error {code:java} [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list (valid range is [1, 2]).; line 9 pos 4 {code} This is because the logical plan is not set as analyzed and it is analyzed again. The analyzer rules about aggregation/sort ordinals are not idempotent. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42415) The built-in dialects support OFFSET and paging query.
jiaan.geng created SPARK-42415: -- Summary: The built-in dialects support OFFSET and paging query. Key: SPARK-42415 URL: https://issues.apache.org/jira/browse/SPARK-42415 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42269) Support complex return types in DDL strings
[ https://issues.apache.org/jira/browse/SPARK-42269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42269. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39964 [https://github.com/apache/spark/pull/39964] > Support complex return types in DDL strings > --- > > Key: SPARK-42269 > URL: https://issues.apache.org/jira/browse/SPARK-42269 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > Fix For: 3.4.0 > > > {code} > # Spark Connect > >>> spark.range(2).select(udf(lambda x: (x, x), "struct >>> y:integer>")("id")) > ... > AssertionError: returnType should be singular > # vanilla PySpark > >>> spark.range(2).select(udf(lambda x: (x, x), "struct >>> y:integer>")("id")) > DataFrame[(id): struct] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42269) Support complex return types in DDL strings
[ https://issues.apache.org/jira/browse/SPARK-42269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42269: Assignee: Xinrong Meng > Support complex return types in DDL strings > --- > > Key: SPARK-42269 > URL: https://issues.apache.org/jira/browse/SPARK-42269 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Xinrong Meng >Priority: Major > > {code} > # Spark Connect > >>> spark.range(2).select(udf(lambda x: (x, x), "struct >>> y:integer>")("id")) > ... > AssertionError: returnType should be singular > # vanilla PySpark > >>> spark.range(2).select(udf(lambda x: (x, x), "struct >>> y:integer>")("id")) > DataFrame[(id): struct] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42034) QueryExecutionListener and Observation API, df.observe do not work with `foreach` action.
[ https://issues.apache.org/jira/browse/SPARK-42034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42034. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 39976 [https://github.com/apache/spark/pull/39976] > QueryExecutionListener and Observation API, df.observe do not work with > `foreach` action. > - > > Key: SPARK-42034 > URL: https://issues.apache.org/jira/browse/SPARK-42034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.2, 3.3.1 > Environment: I test it locally and on YARN in cluster mode. > Spark 3.3.1 and 3.2.2 and 3.1.1. > Yarn 2.9.2 and 3.2.1. >Reporter: Nick Hryhoriev >Assignee: ming95 >Priority: Major > Labels: sql-api > Fix For: 3.5.0 > > > Observation API, {{observe}} dataframe transformation, and custom > QueryExecutionListener. > Do not work with {{foreach}} or {{foreachPartition actions.}} > {{This is due to }}QueryExecutionListener functions do not trigger on queries > whose action is {{foreach}} or {{{}foreachPartition{}}}. > But the Spark GUI SQL tab sees this query as SQL query and shows its query > plans and etc. > here is the code to reproduce it: > https://gist.github.com/GrigorievNick/e7cf9ec5584b417d9719e2812722e6d3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42034) QueryExecutionListener and Observation API, df.observe do not work with `foreach` action.
[ https://issues.apache.org/jira/browse/SPARK-42034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42034: Assignee: ming95 > QueryExecutionListener and Observation API, df.observe do not work with > `foreach` action. > - > > Key: SPARK-42034 > URL: https://issues.apache.org/jira/browse/SPARK-42034 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.3, 3.2.2, 3.3.1 > Environment: I test it locally and on YARN in cluster mode. > Spark 3.3.1 and 3.2.2 and 3.1.1. > Yarn 2.9.2 and 3.2.1. >Reporter: Nick Hryhoriev >Assignee: ming95 >Priority: Major > Labels: sql-api > > Observation API, {{observe}} dataframe transformation, and custom > QueryExecutionListener. > Do not work with {{foreach}} or {{foreachPartition actions.}} > {{This is due to }}QueryExecutionListener functions do not trigger on queries > whose action is {{foreach}} or {{{}foreachPartition{}}}. > But the Spark GUI SQL tab sees this query as SQL query and shows its query > plans and etc. > here is the code to reproduce it: > https://gist.github.com/GrigorievNick/e7cf9ec5584b417d9719e2812722e6d3 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42331) Fix metadata col can not been resolved
[ https://issues.apache.org/jira/browse/SPARK-42331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42331. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39870 [https://github.com/apache/spark/pull/39870] > Fix metadata col can not been resolved > -- > > Key: SPARK-42331 > URL: https://issues.apache.org/jira/browse/SPARK-42331 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42331) Fix metadata col can not been resolved
[ https://issues.apache.org/jira/browse/SPARK-42331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42331: --- Assignee: XiDuo You (was: Apache Spark) > Fix metadata col can not been resolved > -- > > Key: SPARK-42331 > URL: https://issues.apache.org/jira/browse/SPARK-42331 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42331) Fix metadata col can not been resolved
[ https://issues.apache.org/jira/browse/SPARK-42331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42331: --- Assignee: Apache Spark > Fix metadata col can not been resolved > -- > > Key: SPARK-42331 > URL: https://issues.apache.org/jira/browse/SPARK-42331 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42414) Make suggestion in error message smarter
Haejoon Lee created SPARK-42414: --- Summary: Make suggestion in error message smarter Key: SPARK-42414 URL: https://issues.apache.org/jira/browse/SPARK-42414 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Haejoon Lee Current suggestion message e.g. `UNRESOLVED_COLUMN.WITH_SUGGESTION` just pumps out all columns, so it would not very helpful for users. This can be better to show most related columns only to user facing error message. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42391) Close Live AppStore in the finally block for test cases
[ https://issues.apache.org/jira/browse/SPARK-42391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42391: - Assignee: Yang Jie > Close Live AppStore in the finally block for test cases > --- > > Key: SPARK-42391 > URL: https://issues.apache.org/jira/browse/SPARK-42391 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > AppStatusStore#createLiveStore will return RocksDB backend AppStatusStore > when ` > LIVE_UI_LOCAL_STORE_DIR` or `LIVE_UI_LOCAL_STORE_DIR` is configured, should > be closed it in finally block to release resources for test cases -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42391) Close Live AppStore in the finally block for test cases
[ https://issues.apache.org/jira/browse/SPARK-42391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42391. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 39961 [https://github.com/apache/spark/pull/39961] > Close Live AppStore in the finally block for test cases > --- > > Key: SPARK-42391 > URL: https://issues.apache.org/jira/browse/SPARK-42391 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > > AppStatusStore#createLiveStore will return RocksDB backend AppStatusStore > when ` > LIVE_UI_LOCAL_STORE_DIR` or `LIVE_UI_LOCAL_STORE_DIR` is configured, should > be closed it in finally block to release resources for test cases -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module
[ https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687679#comment-17687679 ] Apache Spark commented on SPARK-42410: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39986 > Support Scala 2.12/2.13 tests in connect module > --- > > Key: SPARK-42410 > URL: https://issues.apache.org/jira/browse/SPARK-42410 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > > {code} > $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package > "connect/test" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42263) Implement `spark.catalog.registerFunction`
[ https://issues.apache.org/jira/browse/SPARK-42263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42263: Assignee: (was: Apache Spark) > Implement `spark.catalog.registerFunction` > -- > > Key: SPARK-42263 > URL: https://issues.apache.org/jira/browse/SPARK-42263 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42412) Initial prototype implementation for PySparkML
[ https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42412: Assignee: Apache Spark (was: Weichen Xu) > Initial prototype implementation for PySparkML > -- > > Key: SPARK-42412 > URL: https://issues.apache.org/jira/browse/SPARK-42412 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42412) Initial prototype implementation for PySparkML
[ https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42412: Assignee: Weichen Xu (was: Apache Spark) > Initial prototype implementation for PySparkML > -- > > Key: SPARK-42412 > URL: https://issues.apache.org/jira/browse/SPARK-42412 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42263) Implement `spark.catalog.registerFunction`
[ https://issues.apache.org/jira/browse/SPARK-42263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687678#comment-17687678 ] Apache Spark commented on SPARK-42263: -- User 'xinrong-meng' has created a pull request for this issue: https://github.com/apache/spark/pull/39984 > Implement `spark.catalog.registerFunction` > -- > > Key: SPARK-42263 > URL: https://issues.apache.org/jira/browse/SPARK-42263 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42263) Implement `spark.catalog.registerFunction`
[ https://issues.apache.org/jira/browse/SPARK-42263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42263: Assignee: Apache Spark > Implement `spark.catalog.registerFunction` > -- > > Key: SPARK-42263 > URL: https://issues.apache.org/jira/browse/SPARK-42263 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42412) Initial prototype implementation for PySparkML
[ https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687677#comment-17687677 ] Apache Spark commented on SPARK-42412: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/39985 > Initial prototype implementation for PySparkML > -- > > Key: SPARK-42412 > URL: https://issues.apache.org/jira/browse/SPARK-42412 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42413) Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1
[ https://issues.apache.org/jira/browse/SPARK-42413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan resolved SPARK-42413. - Resolution: Duplicate > Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1 > > > Key: SPARK-42413 > URL: https://issues.apache.org/jira/browse/SPARK-42413 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-42413) Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1
[ https://issues.apache.org/jira/browse/SPARK-42413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan closed SPARK-42413. --- Duplicate > Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1 > > > Key: SPARK-42413 > URL: https://issues.apache.org/jira/browse/SPARK-42413 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42413) Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1
BingKun Pan created SPARK-42413: --- Summary: Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1 Key: SPARK-42413 URL: https://issues.apache.org/jira/browse/SPARK-42413 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42411) Better support for Istio service mesh while running Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-42411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687664#comment-17687664 ] Holden Karau commented on SPARK-42411: -- The other option is something around `spark.network.crypto.enabled` > Better support for Istio service mesh while running Spark on Kubernetes > --- > > Key: SPARK-42411 > URL: https://issues.apache.org/jira/browse/SPARK-42411 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.3 >Reporter: Puneet >Priority: Major > > h3. Support for Strict MTLS > In strict MTLS Peer Authentication Istio requires each pod to be associated > with a service identity (as this allows listeners to use the correct cert and > chain). Without the service identity communication goes through passthrough > cluster which is not permitted in strict mode. Community is still > investigating communication through IPs with strict MTLS > [https://github.com/istio/istio/issues/37431#issuecomment-1412831780]. Today > Spark backend creates a service record for driver however executor pods > register with driver using their Pod IPs. In this model therefore, TLS > handshake would fail between driver and executor and also between executors. > As part of this Jira we want to similarly add service records for the > executor pods as well. This can be achieved by adding a > ExecutorServiceFeatureStep similar to existing DriverServiceFeatureStep > h3. Allowing binding to all IPs > Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to > localhost of the pod. Thus if the application container is binding only to > Pod IP the traffic would not be forwarded to it. This was addressed in 1.10 > [https://istio.io/latest/blog/2021/upcoming-networking-changes]. However the > old behavior is still accessible through disabling the feature flag > PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back > [https://github.com/istio/istio/issues/37642]. In current implementation > Spark K8s backend does not allow to pass bind address for driver > [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35] > however as part of this Jira we want to allow passing of bind address even > in Kubernetes mode so long as the bind address is 0.0.0.0. This lets user > choose the behavior depending on the state of > PILOT_ENABLE_INBOUND_PASSTHROUGH in her Istio cluster. > h3. Better support for istio-proxy sidecar lifecycle management > In istio-enabled cluster istio-proxy sidecars would be auto-injected to > driver/executor pods. If the application is ephemeral then driver and > executor containers would exit, however istio-proxy container would continue > to run. This causes driver/executor pods to enter NotReady state. As part of > this jira we want ability to run a post stop cleanup after driver/executor > container is completed. Similarly we also want to add support for a pre start > up script, which can ensure for example that istio-sidecar is up before > executor/driver container gets started. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42412) Initial prototype implementation for PySparkML
Weichen Xu created SPARK-42412: -- Summary: Initial prototype implementation for PySparkML Key: SPARK-42412 URL: https://issues.apache.org/jira/browse/SPARK-42412 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.5.0 Reporter: Weichen Xu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42263) Implement `spark.catalog.registerFunction`
[ https://issues.apache.org/jira/browse/SPARK-42263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-42263: - Parent: SPARK-41661 Issue Type: Sub-task (was: Improvement) > Implement `spark.catalog.registerFunction` > -- > > Key: SPARK-42263 > URL: https://issues.apache.org/jira/browse/SPARK-42263 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42412) Initial prototype implementation for PySparkML
[ https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-42412: -- Assignee: Weichen Xu > Initial prototype implementation for PySparkML > -- > > Key: SPARK-42412 > URL: https://issues.apache.org/jira/browse/SPARK-42412 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42411) Better support for Istio service mesh while running Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-42411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Puneet updated SPARK-42411: --- Description: h3. Support for Strict MTLS In strict MTLS Peer Authentication Istio requires each pod to be associated with a service identity (as this allows listeners to use the correct cert and chain). Without the service identity communication goes through passthrough cluster which is not permitted in strict mode. Community is still investigating communication through IPs with strict MTLS [https://github.com/istio/istio/issues/37431#issuecomment-1412831780]. Today Spark backend creates a service record for driver however executor pods register with driver using their Pod IPs. In this model therefore, TLS handshake would fail between driver and executor and also between executors. As part of this Jira we want to similarly add service records for the executor pods as well. This can be achieved by adding a ExecutorServiceFeatureStep similar to existing DriverServiceFeatureStep h3. Allowing binding to all IPs Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to localhost of the pod. Thus if the application container is binding only to Pod IP the traffic would not be forwarded to it. This was addressed in 1.10 [https://istio.io/latest/blog/2021/upcoming-networking-changes]. However the old behavior is still accessible through disabling the feature flag PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back [https://github.com/istio/istio/issues/37642]. In current implementation Spark K8s backend does not allow to pass bind address for driver [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35] however as part of this Jira we want to allow passing of bind address even in Kubernetes mode so long as the bind address is 0.0.0.0. This lets user choose the behavior depending on the state of PILOT_ENABLE_INBOUND_PASSTHROUGH in her Istio cluster. h3. Better support for istio-proxy sidecar lifecycle management In istio-enabled cluster istio-proxy sidecars would be auto-injected to driver/executor pods. If the application is ephemeral then driver and executor containers would exit, however istio-proxy container would continue to run. This causes driver/executor pods to enter NotReady state. As part of this jira we want ability to run a post stop cleanup after driver/executor container is completed. Similarly we also want to add support for a pre start up script, which can ensure for example that istio-sidecar is up before executor/driver container gets started. was: h3. Support for Strict MTLS In strict MTLS Peer Authentication Istio requires each pod to be associated with a service identity (as this allows listeners to use the correct cert and chain). Without the service identity communication goes through passthrough cluster which is not permitted in strict mode. Community is still investigating communication through IPs with strict MTLS https://github.com/istio/istio/issues/37431#issuecomment-1412831780. Today Spark backend creates a service record for driver however executor pods register with pod ip with driver. In this model therefore, TLS handshake would fail between driver and executor and also between executors. As part of this jira we want to similarly add service records for the executor pods as well. This can be achieved by adding a ExecutorServiceFeatureStep similar to existing DriverServiceFeatureStep h3. Allowing binding to all IPs Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to localhost of the pod. Thus is the application container is binding only to Pod IP the traffic would not be forwarded to it. This was addressed in 1.10 https://istio.io/latest/blog/2021/upcoming-networking-changes. However the old behavior is still accessible through disabling the feature flag PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back https://github.com/istio/istio/issues/37642. In current implementation Spark K8s backend does not allow to pass bind address for driver https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35 however as part of this jira we want to allow passing of bind address even in Kubernetes mode so long as the bind address is 0.0.0.0. This lets user choose the behavior dependening on state of PILOT_ENABLE_INBOUND_PASSTHROUGH in her Istio cluster. h3. Better support for istio-proxy sidecar lifecycle management In istio-enabled cluster istio-proxy sidecars would be auto-injected to driver/executor pods. If the application is ephemeral then driver and executor containers would exit, however istio-proxy container would continue to run.
[jira] [Resolved] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module
[ https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42410. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39982 [https://github.com/apache/spark/pull/39982] > Support Scala 2.12/2.13 tests in connect module > --- > > Key: SPARK-42410 > URL: https://issues.apache.org/jira/browse/SPARK-42410 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > > {code} > $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package > "connect/test" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module
[ https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42410: - Assignee: Dongjoon Hyun > Support Scala 2.12/2.13 tests in connect module > --- > > Key: SPARK-42410 > URL: https://issues.apache.org/jira/browse/SPARK-42410 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > {code} > $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package > "connect/test" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42411) Better support for Spark on Kubernetes while using Istio service mesh
[ https://issues.apache.org/jira/browse/SPARK-42411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Puneet updated SPARK-42411: --- Summary: Better support for Spark on Kubernetes while using Istio service mesh (was: Add support for istio in strict MTLS PeerAuthentication) > Better support for Spark on Kubernetes while using Istio service mesh > - > > Key: SPARK-42411 > URL: https://issues.apache.org/jira/browse/SPARK-42411 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.3 >Reporter: Puneet >Priority: Major > > h3. Support for Strict MTLS > In strict MTLS Peer Authentication Istio requires each pod to be associated > with a service identity (as this allows listeners to use the correct cert and > chain). Without the service identity communication goes through passthrough > cluster which is not permitted in strict mode. Community is still > investigating communication through IPs with strict MTLS > https://github.com/istio/istio/issues/37431#issuecomment-1412831780. Today > Spark backend creates a service record for driver however executor pods > register with pod ip with driver. In this model therefore, TLS handshake > would fail between driver and executor and also between executors. As part of > this jira we want to similarly add service records for the executor pods as > well. This can be achieved by adding a ExecutorServiceFeatureStep similar to > existing DriverServiceFeatureStep > h3. Allowing binding to all IPs > Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to > localhost of the pod. Thus is the application container is binding only to > Pod IP the traffic would not be forwarded to it. This was addressed in 1.10 > https://istio.io/latest/blog/2021/upcoming-networking-changes. However the > old behavior is still accessible through disabling the feature flag > PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back > https://github.com/istio/istio/issues/37642. In current implementation Spark > K8s backend does not allow to pass bind address for driver > https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35 > however as part of this jira we want to allow passing of bind address even > in Kubernetes mode so long as the bind address is 0.0.0.0. This lets user > choose the behavior dependening on state of PILOT_ENABLE_INBOUND_PASSTHROUGH > in her Istio cluster. > h3. Better support for istio-proxy sidecar lifecycle management > In istio-enabled cluster istio-proxy sidecars would be auto-injected to > driver/executor pods. If the application is ephemeral then driver and > executor containers would exit, however istio-proxy container would continue > to run. This causes driver/executor pods to enter NotReady state. As part of > this jira we want ability to run a post stop cleanup after driver/executor > container is completed. Similarly we also want to add support for a pre start > up script, which can ensure for example that istio-sidecar is up before > executor/driver container gets started. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42411) Better support for Istio service mesh while running Spark on Kubernetes
[ https://issues.apache.org/jira/browse/SPARK-42411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Puneet updated SPARK-42411: --- Summary: Better support for Istio service mesh while running Spark on Kubernetes (was: Better support for Spark on Kubernetes while using Istio service mesh) > Better support for Istio service mesh while running Spark on Kubernetes > --- > > Key: SPARK-42411 > URL: https://issues.apache.org/jira/browse/SPARK-42411 > Project: Spark > Issue Type: New Feature > Components: Kubernetes >Affects Versions: 3.2.3 >Reporter: Puneet >Priority: Major > > h3. Support for Strict MTLS > In strict MTLS Peer Authentication Istio requires each pod to be associated > with a service identity (as this allows listeners to use the correct cert and > chain). Without the service identity communication goes through passthrough > cluster which is not permitted in strict mode. Community is still > investigating communication through IPs with strict MTLS > https://github.com/istio/istio/issues/37431#issuecomment-1412831780. Today > Spark backend creates a service record for driver however executor pods > register with pod ip with driver. In this model therefore, TLS handshake > would fail between driver and executor and also between executors. As part of > this jira we want to similarly add service records for the executor pods as > well. This can be achieved by adding a ExecutorServiceFeatureStep similar to > existing DriverServiceFeatureStep > h3. Allowing binding to all IPs > Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to > localhost of the pod. Thus is the application container is binding only to > Pod IP the traffic would not be forwarded to it. This was addressed in 1.10 > https://istio.io/latest/blog/2021/upcoming-networking-changes. However the > old behavior is still accessible through disabling the feature flag > PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back > https://github.com/istio/istio/issues/37642. In current implementation Spark > K8s backend does not allow to pass bind address for driver > https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35 > however as part of this jira we want to allow passing of bind address even > in Kubernetes mode so long as the bind address is 0.0.0.0. This lets user > choose the behavior dependening on state of PILOT_ENABLE_INBOUND_PASSTHROUGH > in her Istio cluster. > h3. Better support for istio-proxy sidecar lifecycle management > In istio-enabled cluster istio-proxy sidecars would be auto-injected to > driver/executor pods. If the application is ephemeral then driver and > executor containers would exit, however istio-proxy container would continue > to run. This causes driver/executor pods to enter NotReady state. As part of > this jira we want ability to run a post stop cleanup after driver/executor > container is completed. Similarly we also want to add support for a pre start > up script, which can ensure for example that istio-sidecar is up before > executor/driver container gets started. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42411) Add support for istio in strict MTLS PeerAuthentication
Puneet created SPARK-42411: -- Summary: Add support for istio in strict MTLS PeerAuthentication Key: SPARK-42411 URL: https://issues.apache.org/jira/browse/SPARK-42411 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 3.2.3 Reporter: Puneet h3. Support for Strict MTLS In strict MTLS Peer Authentication Istio requires each pod to be associated with a service identity (as this allows listeners to use the correct cert and chain). Without the service identity communication goes through passthrough cluster which is not permitted in strict mode. Community is still investigating communication through IPs with strict MTLS https://github.com/istio/istio/issues/37431#issuecomment-1412831780. Today Spark backend creates a service record for driver however executor pods register with pod ip with driver. In this model therefore, TLS handshake would fail between driver and executor and also between executors. As part of this jira we want to similarly add service records for the executor pods as well. This can be achieved by adding a ExecutorServiceFeatureStep similar to existing DriverServiceFeatureStep h3. Allowing binding to all IPs Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to localhost of the pod. Thus is the application container is binding only to Pod IP the traffic would not be forwarded to it. This was addressed in 1.10 https://istio.io/latest/blog/2021/upcoming-networking-changes. However the old behavior is still accessible through disabling the feature flag PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back https://github.com/istio/istio/issues/37642. In current implementation Spark K8s backend does not allow to pass bind address for driver https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35 however as part of this jira we want to allow passing of bind address even in Kubernetes mode so long as the bind address is 0.0.0.0. This lets user choose the behavior dependening on state of PILOT_ENABLE_INBOUND_PASSTHROUGH in her Istio cluster. h3. Better support for istio-proxy sidecar lifecycle management In istio-enabled cluster istio-proxy sidecars would be auto-injected to driver/executor pods. If the application is ephemeral then driver and executor containers would exit, however istio-proxy container would continue to run. This causes driver/executor pods to enter NotReady state. As part of this jira we want ability to run a post stop cleanup after driver/executor container is completed. Similarly we also want to add support for a pre start up script, which can ensure for example that istio-sidecar is up before executor/driver container gets started. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41963) Different exception message in DataFrame.unpivot
[ https://issues.apache.org/jira/browse/SPARK-41963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41963: Assignee: Takuya Ueshin (was: Hyukjin Kwon) > Different exception message in DataFrame.unpivot > > > Key: SPARK-41963 > URL: https://issues.apache.org/jira/browse/SPARK-41963 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > > Running {{test_parity_dataframe DataFrameParityTests.test_unpivot_negative}} > fails as below: > {code} > with self.subTest(desc="with no value columns"): > for values in [[], ()]: > with self.subTest(values=values): > with self.assertRaisesRegex( > Exception, # (AnalysisException, > SparkConnectException) > r".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one > value column " > r"needs to be specified for UNPIVOT, all columns > specified as ids.*", > ): > > df.unpivot("id", values, "var", "val").collect() > E AssertionError: ".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] > At least one value column needs to be specified for UNPIVOT, all columns > specified as ids.*" does not match "[UNPIVOT_VALUE_DATA_TYPE_MISMATCH] > Unpivot value columns must share a least common type, some types do not: > ["BIGINT" (`int`), "DOUBLE" (`double`), "STRING" (`str`)] > E Plan: 'Unpivot ArraySeq(id#2947L), > List(List(int#2948L), List(double#2949), List(str#2950)), var, [val] > E +- Project [id#2939L AS id#2947L, int#2940L AS > int#2948L, double#2941 AS double#2949, str#2942 AS str#2950] > E +- LocalRelation [id#2939L, int#2940L, > double#2941, str#2942] > E " > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41715) Catch specific exceptions for both Spark Connect and PySpark
[ https://issues.apache.org/jira/browse/SPARK-41715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687655#comment-17687655 ] Apache Spark commented on SPARK-41715: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39983 > Catch specific exceptions for both Spark Connect and PySpark > > > Key: SPARK-41715 > URL: https://issues.apache.org/jira/browse/SPARK-41715 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > In python/pyspark/sql/tests/test_catalog.py, we should catch more specific > exceptions such as AnalysisException. The test is shared in both Spark > Connect and PySpark so we should figure the way out to share it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41963) Different exception message in DataFrame.unpivot
[ https://issues.apache.org/jira/browse/SPARK-41963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41963: Assignee: Hyukjin Kwon > Different exception message in DataFrame.unpivot > > > Key: SPARK-41963 > URL: https://issues.apache.org/jira/browse/SPARK-41963 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Running {{test_parity_dataframe DataFrameParityTests.test_unpivot_negative}} > fails as below: > {code} > with self.subTest(desc="with no value columns"): > for values in [[], ()]: > with self.subTest(values=values): > with self.assertRaisesRegex( > Exception, # (AnalysisException, > SparkConnectException) > r".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one > value column " > r"needs to be specified for UNPIVOT, all columns > specified as ids.*", > ): > > df.unpivot("id", values, "var", "val").collect() > E AssertionError: ".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] > At least one value column needs to be specified for UNPIVOT, all columns > specified as ids.*" does not match "[UNPIVOT_VALUE_DATA_TYPE_MISMATCH] > Unpivot value columns must share a least common type, some types do not: > ["BIGINT" (`int`), "DOUBLE" (`double`), "STRING" (`str`)] > E Plan: 'Unpivot ArraySeq(id#2947L), > List(List(int#2948L), List(double#2949), List(str#2950)), var, [val] > E +- Project [id#2939L AS id#2947L, int#2940L AS > int#2948L, double#2941 AS double#2949, str#2942 AS str#2950] > E +- LocalRelation [id#2939L, int#2940L, > double#2941, str#2942] > E " > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41963) Different exception message in DataFrame.unpivot
[ https://issues.apache.org/jira/browse/SPARK-41963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41963. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39960 [https://github.com/apache/spark/pull/39960] > Different exception message in DataFrame.unpivot > > > Key: SPARK-41963 > URL: https://issues.apache.org/jira/browse/SPARK-41963 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Running {{test_parity_dataframe DataFrameParityTests.test_unpivot_negative}} > fails as below: > {code} > with self.subTest(desc="with no value columns"): > for values in [[], ()]: > with self.subTest(values=values): > with self.assertRaisesRegex( > Exception, # (AnalysisException, > SparkConnectException) > r".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one > value column " > r"needs to be specified for UNPIVOT, all columns > specified as ids.*", > ): > > df.unpivot("id", values, "var", "val").collect() > E AssertionError: ".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] > At least one value column needs to be specified for UNPIVOT, all columns > specified as ids.*" does not match "[UNPIVOT_VALUE_DATA_TYPE_MISMATCH] > Unpivot value columns must share a least common type, some types do not: > ["BIGINT" (`int`), "DOUBLE" (`double`), "STRING" (`str`)] > E Plan: 'Unpivot ArraySeq(id#2947L), > List(List(int#2948L), List(double#2949), List(str#2950)), var, [val] > E +- Project [id#2939L AS id#2947L, int#2940L AS > int#2948L, double#2941 AS double#2949, str#2942 AS str#2950] > E +- LocalRelation [id#2939L, int#2940L, > double#2941, str#2942] > E " > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41715) Catch specific exceptions for both Spark Connect and PySpark
[ https://issues.apache.org/jira/browse/SPARK-41715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687653#comment-17687653 ] Apache Spark commented on SPARK-41715: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39983 > Catch specific exceptions for both Spark Connect and PySpark > > > Key: SPARK-41715 > URL: https://issues.apache.org/jira/browse/SPARK-41715 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > In python/pyspark/sql/tests/test_catalog.py, we should catch more specific > exceptions such as AnalysisException. The test is shared in both Spark > Connect and PySpark so we should figure the way out to share it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40453) Improve error handling for GRPC server
[ https://issues.apache.org/jira/browse/SPARK-40453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687651#comment-17687651 ] Apache Spark commented on SPARK-40453: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39983 > Improve error handling for GRPC server > -- > > Key: SPARK-40453 > URL: https://issues.apache.org/jira/browse/SPARK-40453 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.2.2 >Reporter: Martin Grund >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Right now the errors are handled very rudimentary and do not produce proper > GRPC errors. This issue address the work needed to return proper errors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41715) Catch specific exceptions for both Spark Connect and PySpark
[ https://issues.apache.org/jira/browse/SPARK-41715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687652#comment-17687652 ] Apache Spark commented on SPARK-41715: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39983 > Catch specific exceptions for both Spark Connect and PySpark > > > Key: SPARK-41715 > URL: https://issues.apache.org/jira/browse/SPARK-41715 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 3.4.0 > > > In python/pyspark/sql/tests/test_catalog.py, we should catch more specific > exceptions such as AnalysisException. The test is shared in both Spark > Connect and PySpark so we should figure the way out to share it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40453) Improve error handling for GRPC server
[ https://issues.apache.org/jira/browse/SPARK-40453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687650#comment-17687650 ] Apache Spark commented on SPARK-40453: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39983 > Improve error handling for GRPC server > -- > > Key: SPARK-40453 > URL: https://issues.apache.org/jira/browse/SPARK-40453 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.2.2 >Reporter: Martin Grund >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.4.0 > > > Right now the errors are handled very rudimentary and do not produce proper > GRPC errors. This issue address the work needed to return proper errors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module
[ https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687643#comment-17687643 ] Apache Spark commented on SPARK-42410: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39982 > Support Scala 2.12/2.13 tests in connect module > --- > > Key: SPARK-42410 > URL: https://issues.apache.org/jira/browse/SPARK-42410 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > {code} > $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package > "connect/test" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module
[ https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42410: Assignee: Apache Spark > Support Scala 2.12/2.13 tests in connect module > --- > > Key: SPARK-42410 > URL: https://issues.apache.org/jira/browse/SPARK-42410 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > {code} > $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package > "connect/test" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module
[ https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42410: Assignee: (was: Apache Spark) > Support Scala 2.12/2.13 tests in connect module > --- > > Key: SPARK-42410 > URL: https://issues.apache.org/jira/browse/SPARK-42410 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > {code} > $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package > "connect/test" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module
[ https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687642#comment-17687642 ] Apache Spark commented on SPARK-42410: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39982 > Support Scala 2.12/2.13 tests in connect module > --- > > Key: SPARK-42410 > URL: https://issues.apache.org/jira/browse/SPARK-42410 > Project: Spark > Issue Type: Bug > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > {code} > $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package > "connect/test" > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module
Dongjoon Hyun created SPARK-42410: - Summary: Support Scala 2.12/2.13 tests in connect module Key: SPARK-42410 URL: https://issues.apache.org/jira/browse/SPARK-42410 Project: Spark Issue Type: Bug Components: Connect, Tests Affects Versions: 3.4.0 Reporter: Dongjoon Hyun {code} $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package "connect/test" {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42405) Better documentation of array_insert function
[ https://issues.apache.org/jira/browse/SPARK-42405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42405: Assignee: (was: Apache Spark) > Better documentation of array_insert function > - > > Key: SPARK-42405 > URL: https://issues.apache.org/jira/browse/SPARK-42405 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel Davies >Priority: Trivial > > See the following thread for discussion: > https://github.com/apache/spark/pull/38867#discussion_r1097054656 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42405) Better documentation of array_insert function
[ https://issues.apache.org/jira/browse/SPARK-42405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687637#comment-17687637 ] Apache Spark commented on SPARK-42405: -- User 'Daniel-Davies' has created a pull request for this issue: https://github.com/apache/spark/pull/39975 > Better documentation of array_insert function > - > > Key: SPARK-42405 > URL: https://issues.apache.org/jira/browse/SPARK-42405 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel Davies >Priority: Trivial > > See the following thread for discussion: > https://github.com/apache/spark/pull/38867#discussion_r1097054656 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42405) Better documentation of array_insert function
[ https://issues.apache.org/jira/browse/SPARK-42405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42405: Assignee: Apache Spark > Better documentation of array_insert function > - > > Key: SPARK-42405 > URL: https://issues.apache.org/jira/browse/SPARK-42405 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel Davies >Assignee: Apache Spark >Priority: Trivial > > See the following thread for discussion: > https://github.com/apache/spark/pull/38867#discussion_r1097054656 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42409) Upgrade ZSTD-JNI to 1.5.4-1
[ https://issues.apache.org/jira/browse/SPARK-42409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42409: - Assignee: Yang Jie > Upgrade ZSTD-JNI to 1.5.4-1 > --- > > Key: SPARK-42409 > URL: https://issues.apache.org/jira/browse/SPARK-42409 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42409) Upgrade ZSTD-JNI to 1.5.4-1
[ https://issues.apache.org/jira/browse/SPARK-42409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42409. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 39981 [https://github.com/apache/spark/pull/39981] > Upgrade ZSTD-JNI to 1.5.4-1 > --- > > Key: SPARK-42409 > URL: https://issues.apache.org/jira/browse/SPARK-42409 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append
[ https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bruce Robbins updated SPARK-42401: -- Summary: Incorrect results or NPE when inserting null value into array using array_insert/array_append (was: Incorrect results or NPE when inserting null value using array_insert/array_append) > Incorrect results or NPE when inserting null value into array using > array_insert/array_append > - > > Key: SPARK-42401 > URL: https://issues.apache.org/jira/browse/SPARK-42401 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0, 3.5.0 >Reporter: Bruce Robbins >Priority: Major > Labels: correctness > > Example: > {noformat} > create or replace temp view v1 as > select * from values > (array(1, 2, 3, 4), 5, 5), > (array(1, 2, 3, 4), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > This produces an incorrect result: > {noformat} > [1,2,3,4,5] > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > A more succint example: > {noformat} > select array_insert(array(1, 2, 3, 4), 5, cast(null as int)); > {noformat} > This also produces an incorrect result: > {noformat} > [1,2,3,4,0] <== should be [1,2,3,4,null] > {noformat} > Another example: > {noformat} > create or replace temp view v1 as > select * from values > (array('1', '2', '3', '4'), 5, '5'), > (array('1', '2', '3', '4'), 5, null) > as v1(col1,col2,col3); > select array_insert(col1, col2, col3) from v1; > {noformat} > The above query throws a {{NullPointerException}}: > {noformat} > 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, > col2, col3) from v1] > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44) > {noformat} > {{array_append}} has the same issue: > {noformat} > spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int)); > [1,2,3,4,0] <== should be [1,2,3,4,null] > Time taken: 3.679 seconds, Fetched 1 row(s) > spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as > string)); > 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) > java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42400) Code clean up in org.apache.spark.storage
[ https://issues.apache.org/jira/browse/SPARK-42400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42400. -- Fix Version/s: 3.5.0 Assignee: Khalid Mammadov Resolution: Fixed Resolved by https://github.com/apache/spark/pull/39932 > Code clean up in org.apache.spark.storage > - > > Key: SPARK-42400 > URL: https://issues.apache.org/jira/browse/SPARK-42400 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.4.0 >Reporter: Khalid Mammadov >Assignee: Khalid Mammadov >Priority: Trivial > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42400) Code clean up in org.apache.spark.storage
[ https://issues.apache.org/jira/browse/SPARK-42400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42400: - Priority: Trivial (was: Major) > Code clean up in org.apache.spark.storage > - > > Key: SPARK-42400 > URL: https://issues.apache.org/jira/browse/SPARK-42400 > Project: Spark > Issue Type: Improvement > Components: Block Manager >Affects Versions: 3.4.0 >Reporter: Khalid Mammadov >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42312) Assign name to _LEGACY_ERROR_TEMP_0042
[ https://issues.apache.org/jira/browse/SPARK-42312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-42312. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39951 [https://github.com/apache/spark/pull/39951] > Assign name to _LEGACY_ERROR_TEMP_0042 > -- > > Key: SPARK-42312 > URL: https://issues.apache.org/jira/browse/SPARK-42312 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42312) Assign name to _LEGACY_ERROR_TEMP_0042
[ https://issues.apache.org/jira/browse/SPARK-42312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-42312: Assignee: Haejoon Lee > Assign name to _LEGACY_ERROR_TEMP_0042 > -- > > Key: SPARK-42312 > URL: https://issues.apache.org/jira/browse/SPARK-42312 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40678) JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13
[ https://issues.apache.org/jira/browse/SPARK-40678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687573#comment-17687573 ] Wei Guo commented on SPARK-40678: - Fixed by PR 38154 https://github.com/apache/spark/pull/38154 > JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13 > > > Key: SPARK-40678 > URL: https://issues.apache.org/jira/browse/SPARK-40678 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.2.0 >Reporter: Cédric Chantepie >Priority: Major > > In Spark 3.2 (Scala 2.13), values with {{ArrayType}} are no longer properly > support with JSON; e.g. > {noformat} > import org.apache.spark.sql.SparkSession > case class KeyValue(key: String, value: Array[Byte]) > val spark = > SparkSession.builder().master("local[1]").appName("test").getOrCreate() > import spark.implicits._ > val df = Seq(Array(KeyValue("foo", "bar".getBytes))).toDF() > df.foreach(r => println(r.json)) > {noformat} > Expected: > {noformat} > [{foo, bar}] > {noformat} > Encountered: > {noformat} > java.lang.IllegalArgumentException: Failed to convert value > ArraySeq([foo,[B@dcdb68f]) (class of class > scala.collection.mutable.ArraySeq$ofRef}) with the type of > ArrayType(Seq(StructField(key,StringType,false), > StructField(value,BinaryType,false)),true) to JSON. > at org.apache.spark.sql.Row.toJson$1(Row.scala:604) > at org.apache.spark.sql.Row.jsonValue(Row.scala:613) > at org.apache.spark.sql.Row.jsonValue$(Row.scala:552) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.jsonValue(rows.scala:166) > at org.apache.spark.sql.Row.json(Row.scala:535) > at org.apache.spark.sql.Row.json$(Row.scala:535) > at > org.apache.spark.sql.catalyst.expressions.GenericRow.json(rows.scala:166) > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42409) Upgrade ZSTD-JNI to 1.5.4-1
[ https://issues.apache.org/jira/browse/SPARK-42409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42409: Assignee: (was: Apache Spark) > Upgrade ZSTD-JNI to 1.5.4-1 > --- > > Key: SPARK-42409 > URL: https://issues.apache.org/jira/browse/SPARK-42409 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org