date:20230212



 [ 
https://issues.apache.org/jira/browse/SPARK-42418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42418:


Assignee: Apache Spark

> Updating PySpark documentation to support new users better
> --
>
> Key: SPARK-42418
> URL: https://issues.apache.org/jira/browse/SPARK-42418
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Assignee: Apache Spark
>Priority: Major
>
> This is the first of a series of updates to the PySpark documentation site to 
> better guide new users on what to use and when as well as help improve 
> discoverability of related pages/resources.
>  * Add "Overview" to the top navigation bar to make it easy to get back to 
> the main page (clicking the logo is not super discoverable)
>  * Break architecture image into separate, clickable parts for easy 
> navigation to information for each part
>  * Added links to related topics under each area description
>  * Added date and version to the page



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42418) Updating PySpark documentation to support new users better



 [ 
https://issues.apache.org/jira/browse/SPARK-42418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42418:


Assignee: (was: Apache Spark)

> Updating PySpark documentation to support new users better
> --
>
> Key: SPARK-42418
> URL: https://issues.apache.org/jira/browse/SPARK-42418
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Priority: Major
>
> This is the first of a series of updates to the PySpark documentation site to 
> better guide new users on what to use and when as well as help improve 
> discoverability of related pages/resources.
>  * Add "Overview" to the top navigation bar to make it easy to get back to 
> the main page (clicking the logo is not super discoverable)
>  * Break architecture image into separate, clickable parts for easy 
> navigation to information for each part
>  * Added links to related topics under each area description
>  * Added date and version to the page



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42418) Updating PySpark documentation to support new users better



[ 
https://issues.apache.org/jira/browse/SPARK-42418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687775#comment-17687775
 ] 

Apache Spark commented on SPARK-42418:
--

User 'allanf-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/39992

> Updating PySpark documentation to support new users better
> --
>
> Key: SPARK-42418
> URL: https://issues.apache.org/jira/browse/SPARK-42418
> Project: Spark
>  Issue Type: Documentation
>  Components: PySpark
>Affects Versions: 3.4.0
>Reporter: Allan Folting
>Priority: Major
>
> This is the first of a series of updates to the PySpark documentation site to 
> better guide new users on what to use and when as well as help improve 
> discoverability of related pages/resources.
>  * Add "Overview" to the top navigation bar to make it easy to get back to 
> the main page (clicking the logo is not super discoverable)
>  * Break architecture image into separate, clickable parts for easy 
> navigation to information for each part
>  * Added links to related topics under each area description
>  * Added date and version to the page



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42193) dataframe API filter criteria throwing ParseException when reading a JDBC column name with special characters



[ 
https://issues.apache.org/jira/browse/SPARK-42193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687770#comment-17687770
 ] 

Dongjoon Hyun commented on SPARK-42193:
---

+1 for [~maxgekk]'s assessment. 

> dataframe API filter criteria throwing ParseException when reading a JDBC 
> column name with special characters
> -
>
> Key: SPARK-42193
> URL: https://issues.apache.org/jira/browse/SPARK-42193
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Shanmugavel Kuttiyandi Chandrakasu
>Priority: Minor
>
> *On Spark 3.3.0,* when reading from a JDBC table(used SQLite to repro) using 
> spark.read.jdbc command with sqlite-jdbc:3.34.0.jar on a table and column 
> name containing special characters. Dataframe API filter criteria fails with 
> parse Exception 
> *[#Script:]*
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("Databricks Support") \
> .config("spark.jars.packages", "org.xerial:sqlite-jdbc:3.34.0") \
> .getOrCreate()
> columns = ["id", "/abc/column", "value"]
> data = [(1, 'A', 100), (2, 'B', 200), (3, 'B', 300)]
> rdd = spark.sparkContext.parallelize(data)
> df = spark.createDataFrame(rdd).toDF(*columns)
> options = {"url": 
> "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db", "dbtable": 
> '"/abc/table"', "driver": "org.sqlite.JDBC"}
> df.coalesce(1).write.format("jdbc").options(**options).mode("append").save()
> df_1 = spark.read.format("jdbc") \
> .option("url", 
> "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db") \
> .option("dbtable", '"/abc/table"') \
> .option("driver", "org.sqlite.JDBC") \
> .load()
> df_2 = df_1.filter("`/abc/column` = 'B'")
> df_2.show() {code}
> Error:
> {code:java}
> ``` Traceback (most recent call last):
>  File "", line 1, in 
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/dataframe.py",
>  line 606, in show
>   print(self._jdf.showString(n, 20, vertical))
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/utils.py",
>  line 196, in deco
>   raise converted from None
> pyspark.sql.utils.ParseException: 
> Syntax error at or near '/': extra input '/'(line 1, pos 0)
> == SQL ==
> /abc/column
> ^^^```  {code}
> However, when using Spark 3.2.1, we are able to successfully apply 
> dataframe.filter option
> {code:java}
> >>> df_2.show()
> +---+---+-+
> | id|/abc/column|value|
> +---+---+-+
> |  2|          B|  200|
> |  3|          B|  300|
> +---+---+-+ {code}
> *Repro steps:*
>  # Download [Spark 3.2.1 in local 
> |https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz]
>  # Download and Copy the sqlite-jdbc:3.34.0.jar into the jar folder present 
> in the local spark download folder
>  # Run the above [#script] by providing the jar path 
>  # This will create a */abc/table* with column */abc/column*  and returns 
> result when applying filter criteria
>  # Download spark ** [3.3.0 in 
> local|https://www.apache.org/dyn/closer.lua/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz]
>  # Repeat #2, #3 
>  # Fails with parse exception. 
> could you please let us know how we can filter on the special characters 
> column or escape them on spark version 3.3.0?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42193) dataframe API filter criteria throwing ParseException when reading a JDBC column name with special characters



 [ 
https://issues.apache.org/jira/browse/SPARK-42193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42193:
-
Component/s: SQL
 (was: Spark Core)

> dataframe API filter criteria throwing ParseException when reading a JDBC 
> column name with special characters
> -
>
> Key: SPARK-42193
> URL: https://issues.apache.org/jira/browse/SPARK-42193
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Shanmugavel Kuttiyandi Chandrakasu
>Priority: Minor
>
> *On Spark 3.3.0,* when reading from a JDBC table(used SQLite to repro) using 
> spark.read.jdbc command with sqlite-jdbc:3.34.0.jar on a table and column 
> name containing special characters. Dataframe API filter criteria fails with 
> parse Exception 
> *[#Script:]*
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("Databricks Support") \
> .config("spark.jars.packages", "org.xerial:sqlite-jdbc:3.34.0") \
> .getOrCreate()
> columns = ["id", "/abc/column", "value"]
> data = [(1, 'A', 100), (2, 'B', 200), (3, 'B', 300)]
> rdd = spark.sparkContext.parallelize(data)
> df = spark.createDataFrame(rdd).toDF(*columns)
> options = {"url": 
> "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db", "dbtable": 
> '"/abc/table"', "driver": "org.sqlite.JDBC"}
> df.coalesce(1).write.format("jdbc").options(**options).mode("append").save()
> df_1 = spark.read.format("jdbc") \
> .option("url", 
> "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db") \
> .option("dbtable", '"/abc/table"') \
> .option("driver", "org.sqlite.JDBC") \
> .load()
> df_2 = df_1.filter("`/abc/column` = 'B'")
> df_2.show() {code}
> Error:
> {code:java}
> ``` Traceback (most recent call last):
>  File "", line 1, in 
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/dataframe.py",
>  line 606, in show
>   print(self._jdf.showString(n, 20, vertical))
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/utils.py",
>  line 196, in deco
>   raise converted from None
> pyspark.sql.utils.ParseException: 
> Syntax error at or near '/': extra input '/'(line 1, pos 0)
> == SQL ==
> /abc/column
> ^^^```  {code}
> However, when using Spark 3.2.1, we are able to successfully apply 
> dataframe.filter option
> {code:java}
> >>> df_2.show()
> +---+---+-+
> | id|/abc/column|value|
> +---+---+-+
> |  2|          B|  200|
> |  3|          B|  300|
> +---+---+-+ {code}
> *Repro steps:*
>  # Download [Spark 3.2.1 in local 
> |https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz]
>  # Download and Copy the sqlite-jdbc:3.34.0.jar into the jar folder present 
> in the local spark download folder
>  # Run the above [#script] by providing the jar path 
>  # This will create a */abc/table* with column */abc/column*  and returns 
> result when applying filter criteria
>  # Download spark ** [3.3.0 in 
> local|https://www.apache.org/dyn/closer.lua/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz]
>  # Repeat #2, #3 
>  # Fails with parse exception. 
> could you please let us know how we can filter on the special characters 
> column or escape them on spark version 3.3.0?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42193) dataframe API filter criteria throwing ParseException when reading a JDBC column name with special characters



 [ 
https://issues.apache.org/jira/browse/SPARK-42193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42193.
--
Resolution: Cannot Reproduce

> dataframe API filter criteria throwing ParseException when reading a JDBC 
> column name with special characters
> -
>
> Key: SPARK-42193
> URL: https://issues.apache.org/jira/browse/SPARK-42193
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Shanmugavel Kuttiyandi Chandrakasu
>Priority: Minor
>
> *On Spark 3.3.0,* when reading from a JDBC table(used SQLite to repro) using 
> spark.read.jdbc command with sqlite-jdbc:3.34.0.jar on a table and column 
> name containing special characters. Dataframe API filter criteria fails with 
> parse Exception 
> *[#Script:]*
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("Databricks Support") \
> .config("spark.jars.packages", "org.xerial:sqlite-jdbc:3.34.0") \
> .getOrCreate()
> columns = ["id", "/abc/column", "value"]
> data = [(1, 'A', 100), (2, 'B', 200), (3, 'B', 300)]
> rdd = spark.sparkContext.parallelize(data)
> df = spark.createDataFrame(rdd).toDF(*columns)
> options = {"url": 
> "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db", "dbtable": 
> '"/abc/table"', "driver": "org.sqlite.JDBC"}
> df.coalesce(1).write.format("jdbc").options(**options).mode("append").save()
> df_1 = spark.read.format("jdbc") \
> .option("url", 
> "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db") \
> .option("dbtable", '"/abc/table"') \
> .option("driver", "org.sqlite.JDBC") \
> .load()
> df_2 = df_1.filter("`/abc/column` = 'B'")
> df_2.show() {code}
> Error:
> {code:java}
> ``` Traceback (most recent call last):
>  File "", line 1, in 
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/dataframe.py",
>  line 606, in show
>   print(self._jdf.showString(n, 20, vertical))
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/utils.py",
>  line 196, in deco
>   raise converted from None
> pyspark.sql.utils.ParseException: 
> Syntax error at or near '/': extra input '/'(line 1, pos 0)
> == SQL ==
> /abc/column
> ^^^```  {code}
> However, when using Spark 3.2.1, we are able to successfully apply 
> dataframe.filter option
> {code:java}
> >>> df_2.show()
> +---+---+-+
> | id|/abc/column|value|
> +---+---+-+
> |  2|          B|  200|
> |  3|          B|  300|
> +---+---+-+ {code}
> *Repro steps:*
>  # Download [Spark 3.2.1 in local 
> |https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz]
>  # Download and Copy the sqlite-jdbc:3.34.0.jar into the jar folder present 
> in the local spark download folder
>  # Run the above [#script] by providing the jar path 
>  # This will create a */abc/table* with column */abc/column*  and returns 
> result when applying filter criteria
>  # Download spark ** [3.3.0 in 
> local|https://www.apache.org/dyn/closer.lua/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz]
>  # Repeat #2, #3 
>  # Fails with parse exception. 
> could you please let us know how we can filter on the special characters 
> column or escape them on spark version 3.3.0?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42227) Use approx_percentile function running slower in spark3 than spark2



[ 
https://issues.apache.org/jira/browse/SPARK-42227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687765#comment-17687765
 ] 

Hyukjin Kwon commented on SPARK-42227:
--

How much fast is it?

> Use approx_percentile function running slower in spark3 than spark2
> ---
>
> Key: SPARK-42227
> URL: https://issues.apache.org/jira/browse/SPARK-42227
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: xuanzhiang
>Priority: Major
>
> approx_percentile(end_ts-start_ts,0.9) cost_p90
> in spark3 , it use objectHashAggregate method , but it shuffle very slow. 
> when i use percentile , it become fast. i dont know the reson, i think 
> approx_percentile should fast.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42293) why executor memory used is shown greater than total available memory on spark ui



[ 
https://issues.apache.org/jira/browse/SPARK-42293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687763#comment-17687763
 ] 

Hyukjin Kwon commented on SPARK-42293:
--

[~handong] mind sharing a reproducer if you have?

> why executor memory used is shown greater than total available memory on 
> spark ui
> -
>
> Key: SPARK-42293
> URL: https://issues.apache.org/jira/browse/SPARK-42293
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5
>Reporter: handong
>Priority: Major
>
> *I have a spark  streaming job that is running for around last 3 weeks. When 
> I open the Executors tab on spark web UI, it shows*
>  # {{memory used - 36.1GB}}
>  # total available memory for storage - 3.2GB
> *Please refer to the below screenshot of Spark UI*
> !https://i.stack.imgur.com/nmk39.jpg!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.



[ 
https://issues.apache.org/jira/browse/SPARK-42419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687764#comment-17687764
 ] 

Apache Spark commented on SPARK-42419:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39991

> Migrate `TypeError` into error framework for Spark Connect column API.
> --
>
> Key: SPARK-42419
> URL: https://issues.apache.org/jira/browse/SPARK-42419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should migrate all errors into PySpark error framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.



 [ 
https://issues.apache.org/jira/browse/SPARK-42419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42419:


Assignee: Apache Spark

> Migrate `TypeError` into error framework for Spark Connect column API.
> --
>
> Key: SPARK-42419
> URL: https://issues.apache.org/jira/browse/SPARK-42419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> We should migrate all errors into PySpark error framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.



[ 
https://issues.apache.org/jira/browse/SPARK-42419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687762#comment-17687762
 ] 

Apache Spark commented on SPARK-42419:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39991

> Migrate `TypeError` into error framework for Spark Connect column API.
> --
>
> Key: SPARK-42419
> URL: https://issues.apache.org/jira/browse/SPARK-42419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should migrate all errors into PySpark error framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.



 [ 
https://issues.apache.org/jira/browse/SPARK-42419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42419:


Assignee: (was: Apache Spark)

> Migrate `TypeError` into error framework for Spark Connect column API.
> --
>
> Key: SPARK-42419
> URL: https://issues.apache.org/jira/browse/SPARK-42419
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>
> We should migrate all errors into PySpark error framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42387) Avoid unnecessary parquet footer reads when no filters



[ 
https://issues.apache.org/jira/browse/SPARK-42387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687760#comment-17687760
 ] 

Hyukjin Kwon commented on SPARK-42387:
--

[~miracle] mind filling JIRA description please

> Avoid unnecessary parquet footer reads when no filters
> --
>
> Key: SPARK-42387
> URL: https://issues.apache.org/jira/browse/SPARK-42387
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Mars
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42397) Inconsistent data produced by `FlatMapCoGroupsInPandas`



[ 
https://issues.apache.org/jira/browse/SPARK-42397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687759#comment-17687759
 ] 

Hyukjin Kwon commented on SPARK-42397:
--

It's probably related the order which Spark doesn't guarantee. Is the actual 
value different?

> Inconsistent data produced by `FlatMapCoGroupsInPandas`
> ---
>
> Key: SPARK-42397
> URL: https://issues.apache.org/jira/browse/SPARK-42397
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark, SQL
>Affects Versions: 3.3.0, 3.3.1
>Reporter: Ted Chester Jenks
>Priority: Minor
>
> We are seeing inconsistent data returned when using 
> `FlatMapCoGroupsInPandas`. In the PySpark example from the comments, when we 
> call `grouped_df.collect()` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'event', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')")] }}
>  
> When we call `grouped_df.show(5, truncate=False)` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')", 
> xyz='1234')] }}
>  
> When we call `grouped_df_1.collect()` we get:
>  
> {{[Row(left_colms="Index(['cluster', 'abc'], dtype='object')", 
> right_colms="Index(['cluster', 'event', 'def'], dtype='object')", 
> xyz='1234')] }}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42399) CONV() silently overflows returning wrong results



 [ 
https://issues.apache.org/jira/browse/SPARK-42399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42399:
-
Component/s: SQL
 (was: Spark Core)

> CONV() silently overflows returning wrong results
> -
>
> Key: SPARK-42399
> URL: https://issues.apache.org/jira/browse/SPARK-42399
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Serge Rielau
>Priority: Critical
>
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 2.114 seconds, Fetched 1 row(s)
> spark-sql> set spark.sql.ansi.enabled = true;
> spark.sql.ansi.enabled true
> Time taken: 0.068 seconds, Fetched 1 row(s)
> spark-sql> SELECT 
> CONV(SUBSTRING('0x',
>  3), 16, 10);
> 18446744073709551615
> Time taken: 0.05 seconds, Fetched 1 row(s)
> In ANSI mode we should raise an error for sure.
> In non ANSI either an error or a NULL maybe be acceptable.
> Alternatively, of course, we could consider if we can support arbitrary 
> domains since the result is a STRING again. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append



 [ 
https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42401:


Assignee: Bruce Robbins

> Incorrect results or NPE when inserting null value into array using 
> array_insert/array_append
> -
>
> Key: SPARK-42401
> URL: https://issues.apache.org/jira/browse/SPARK-42401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array(1, 2, 3, 4), 5, 5),
> (array(1, 2, 3, 4), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> This produces an incorrect result:
> {noformat}
> [1,2,3,4,5]
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> A more succint example:
> {noformat}
> select array_insert(array(1, 2, 3, 4), 5, cast(null as int));
> {noformat}
> This also produces an incorrect result:
> {noformat}
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> Another example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array('1', '2', '3', '4'), 5, '5'),
> (array('1', '2', '3', '4'), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> The above query throws a {{NullPointerException}}:
> {noformat}
> 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, 
> col2, col3) from v1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
> {noformat}
> {{array_append}} has the same issue:
> {noformat}
> spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int));
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> Time taken: 3.679 seconds, Fetched 1 row(s)
> spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as 
> string));
> 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append



 [ 
https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42401.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39970
[https://github.com/apache/spark/pull/39970]

> Incorrect results or NPE when inserting null value into array using 
> array_insert/array_append
> -
>
> Key: SPARK-42401
> URL: https://issues.apache.org/jira/browse/SPARK-42401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.0
>
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array(1, 2, 3, 4), 5, 5),
> (array(1, 2, 3, 4), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> This produces an incorrect result:
> {noformat}
> [1,2,3,4,5]
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> A more succint example:
> {noformat}
> select array_insert(array(1, 2, 3, 4), 5, cast(null as int));
> {noformat}
> This also produces an incorrect result:
> {noformat}
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> Another example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array('1', '2', '3', '4'), 5, '5'),
> (array('1', '2', '3', '4'), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> The above query throws a {{NullPointerException}}:
> {noformat}
> 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, 
> col2, col3) from v1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
> {noformat}
> {{array_append}} has the same issue:
> {noformat}
> spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int));
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> Time taken: 3.679 seconds, Fetched 1 row(s)
> spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as 
> string));
> 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42419) Migrate `TypeError` into error framework for Spark Connect column API.

2023-02-12 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-42419:
---

 Summary: Migrate `TypeError` into error framework for Spark 
Connect column API.
 Key: SPARK-42419
 URL: https://issues.apache.org/jira/browse/SPARK-42419
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Haejoon Lee


We should migrate all errors into PySpark error framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42258) pyspark.sql.functions should not expose typing.cast



[ 
https://issues.apache.org/jira/browse/SPARK-42258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687757#comment-17687757
 ] 

Hyukjin Kwon commented on SPARK-42258:
--

Good point. Are you interested in submitting a PR?

> pyspark.sql.functions should not expose typing.cast
> ---
>
> Key: SPARK-42258
> URL: https://issues.apache.org/jira/browse/SPARK-42258
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.1
>Reporter: Furcy Pin
>Priority: Minor
>
> In pyspark, the `pyspark.sql.functions` modules imports and exposes the 
> method `typing.cast`.
> This may lead to errors from users that can be hard to spot.
> *Example*
> It took me a few minutes to understand why the following code:
>  
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql import functions as f
> spark = SparkSession.builder.getOrCreate()
> df = spark.sql("""SELECT 1 as a""")
> df.withColumn("a", f.cast("STRING", f.col("a"))).printSchema()  {code}
> which executes without any problem, gives the following result:
>  
>  
> {code:java}
> root
> |-- a: integer (nullable = false){code}
> This is because `f.cast` here calls `typing.cast, and the correct syntax is:
> {code:java}
> df.withColumn("a", f.col("a").cast("STRING")).printSchema(){code}
>  
> which indeed gives:
> {code:java}
> root
>  |-- a: string (nullable = false) {code}
> *Suggestion of solution*
> Option 1: The methods imported in the module `pyspark.sql.functions` could be 
> obfuscated to prevent this. For instance:
> {code:java}
> from typing import cast as _cast{code}
> Option 2: only import `typing` and replace all occurrences of `cast` with 
> `typing.cast`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42407) `with as` executed again



 [ 
https://issues.apache.org/jira/browse/SPARK-42407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42407:
-
Priority: Major  (was: Critical)

> `with as` executed again
> 
>
> Key: SPARK-42407
> URL: https://issues.apache.org/jira/browse/SPARK-42407
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: yiku123
>Priority: Major
>
> When 'with as' is used multiple times, it will be executed again each time 
> without saving the results of' with as', resulting in low efficiency.
> Will you consider improving the behavior of 'with as'
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42193) dataframe API filter criteria throwing ParseException when reading a JDBC column name with special characters

2023-02-12 Thread Max Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687756#comment-17687756
 ] 

Max Gekk commented on SPARK-42193:
--

I haven't reproduced the issue on the recent master. Seems like it has been 
already fixed by [~huaxingao] in 
https://issues.apache.org/jira/browse/SPARK-41990 also cc [~dongjoon]

> dataframe API filter criteria throwing ParseException when reading a JDBC 
> column name with special characters
> -
>
> Key: SPARK-42193
> URL: https://issues.apache.org/jira/browse/SPARK-42193
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Shanmugavel Kuttiyandi Chandrakasu
>Priority: Minor
>
> *On Spark 3.3.0,* when reading from a JDBC table(used SQLite to repro) using 
> spark.read.jdbc command with sqlite-jdbc:3.34.0.jar on a table and column 
> name containing special characters. Dataframe API filter criteria fails with 
> parse Exception 
> *[#Script:]*
> {code:java}
> from pyspark.sql import SparkSession
> spark = SparkSession \
> .builder \
> .appName("Databricks Support") \
> .config("spark.jars.packages", "org.xerial:sqlite-jdbc:3.34.0") \
> .getOrCreate()
> columns = ["id", "/abc/column", "value"]
> data = [(1, 'A', 100), (2, 'B', 200), (3, 'B', 300)]
> rdd = spark.sparkContext.parallelize(data)
> df = spark.createDataFrame(rdd).toDF(*columns)
> options = {"url": 
> "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db", "dbtable": 
> '"/abc/table"', "driver": "org.sqlite.JDBC"}
> df.coalesce(1).write.format("jdbc").options(**options).mode("append").save()
> df_1 = spark.read.format("jdbc") \
> .option("url", 
> "jdbc:sqlite://spark-3.3.1-bin-hadoop3/jars/test.db") \
> .option("dbtable", '"/abc/table"') \
> .option("driver", "org.sqlite.JDBC") \
> .load()
> df_2 = df_1.filter("`/abc/column` = 'B'")
> df_2.show() {code}
> Error:
> {code:java}
> ``` Traceback (most recent call last):
>  File "", line 1, in 
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/dataframe.py",
>  line 606, in show
>   print(self._jdf.showString(n, 20, vertical))
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, in __call__
>  File 
> "/opt/homebrew/Cellar/apache-spark/3.3.1/libexec/python/pyspark/sql/utils.py",
>  line 196, in deco
>   raise converted from None
> pyspark.sql.utils.ParseException: 
> Syntax error at or near '/': extra input '/'(line 1, pos 0)
> == SQL ==
> /abc/column
> ^^^```  {code}
> However, when using Spark 3.2.1, we are able to successfully apply 
> dataframe.filter option
> {code:java}
> >>> df_2.show()
> +---+---+-+
> | id|/abc/column|value|
> +---+---+-+
> |  2|          B|  200|
> |  3|          B|  300|
> +---+---+-+ {code}
> *Repro steps:*
>  # Download [Spark 3.2.1 in local 
> |https://archive.apache.org/dist/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz]
>  # Download and Copy the sqlite-jdbc:3.34.0.jar into the jar folder present 
> in the local spark download folder
>  # Run the above [#script] by providing the jar path 
>  # This will create a */abc/table* with column */abc/column*  and returns 
> result when applying filter criteria
>  # Download spark ** [3.3.0 in 
> local|https://www.apache.org/dyn/closer.lua/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz]
>  # Repeat #2, #3 
>  # Fails with parse exception. 
> could you please let us know how we can filter on the special characters 
> column or escape them on spark version 3.3.0?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42417) Upgrade `netty` to version 4.1.88.Final



 [ 
https://issues.apache.org/jira/browse/SPARK-42417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42417:


Assignee: (was: Apache Spark)

>  Upgrade `netty` to version 4.1.88.Final
> 
>
> Key: SPARK-42417
> URL: https://issues.apache.org/jira/browse/SPARK-42417
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42415) The built-in dialects support OFFSET and paging query.



 [ 
https://issues.apache.org/jira/browse/SPARK-42415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42415:


Assignee: Apache Spark

> The built-in dialects support OFFSET and paging query.
> --
>
> Key: SPARK-42415
> URL: https://issues.apache.org/jira/browse/SPARK-42415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42415) The built-in dialects support OFFSET and paging query.



[ 
https://issues.apache.org/jira/browse/SPARK-42415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687753#comment-17687753
 ] 

Apache Spark commented on SPARK-42415:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/39990

> The built-in dialects support OFFSET and paging query.
> --
>
> Key: SPARK-42415
> URL: https://issues.apache.org/jira/browse/SPARK-42415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42415) The built-in dialects support OFFSET and paging query.



 [ 
https://issues.apache.org/jira/browse/SPARK-42415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42415:


Assignee: (was: Apache Spark)

> The built-in dialects support OFFSET and paging query.
> --
>
> Key: SPARK-42415
> URL: https://issues.apache.org/jira/browse/SPARK-42415
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42417) Upgrade `netty` to version 4.1.88.Final



[ 
https://issues.apache.org/jira/browse/SPARK-42417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687754#comment-17687754
 ] 

Apache Spark commented on SPARK-42417:
--

User 'panbingkun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39989

>  Upgrade `netty` to version 4.1.88.Final
> 
>
> Key: SPARK-42417
> URL: https://issues.apache.org/jira/browse/SPARK-42417
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42417) Upgrade `netty` to version 4.1.88.Final



 [ 
https://issues.apache.org/jira/browse/SPARK-42417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42417:


Assignee: Apache Spark

>  Upgrade `netty` to version 4.1.88.Final
> 
>
> Key: SPARK-42417
> URL: https://issues.apache.org/jira/browse/SPARK-42417
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42418) Updating PySpark documentation to support new users better

2023-02-12 Thread Allan Folting (Jira)

Allan Folting created SPARK-42418:
-

 Summary: Updating PySpark documentation to support new users better
 Key: SPARK-42418
 URL: https://issues.apache.org/jira/browse/SPARK-42418
 Project: Spark
  Issue Type: Documentation
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Allan Folting


This is the first of a series of updates to the PySpark documentation site to 
better guide new users on what to use and when as well as help improve 
discoverability of related pages/resources.
 * Add "Overview" to the top navigation bar to make it easy to get back to the 
main page (clicking the logo is not super discoverable)
 * Break architecture image into separate, clickable parts for easy navigation 
to information for each part
 * Added links to related topics under each area description
 * Added date and version to the page



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42416) Dateset operations should not resolve the analyzed logical plan again

2023-02-12 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-42416:
---
Summary: Dateset operations should not resolve the analyzed logical plan 
again  (was: Dateset.show() should not resolve the analyzed logical plan again)

> Dateset operations should not resolve the analyzed logical plan again
> -
>
> Key: SPARK-42416
> URL: https://issues.apache.org/jira/browse/SPARK-42416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> For the following query
>  
> {code:java}
>   sql(
> """
>   |CREATE TABLE app_open (
>   |  uid STRING,
>   |  st TIMESTAMP,
>   |  ds INT
>   |) USING parquet PARTITIONED BY (ds);
>   |""".stripMargin)
>   sql(
> """
>   |create or replace temporary view group_by_error as WITH 
> new_app_open AS (
>   |  SELECT
>   |ao.*
>   |  FROM
>   |app_open ao
>   |)
>   |SELECT
>   |uid,
>   |20230208 AS ds
>   |  FROM
>   |new_app_open
>   |  GROUP BY
>   |1,
>   |2
>   |""".stripMargin)
>   sql(
> """
>   |select
>   |  `uid`
>   |from
>   |  group_by_error
>   |""".stripMargin).show(){code}
> Spark will throw the following error
>  
>  
> {code:java}
> [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list 
> (valid range is [1, 2]).; line 9 pos 4 {code}
>  
>  
> This is because the logical plan is not set as analyzed and it is analyzed 
> again. The analyzer rules about aggregation/sort ordinals are not idempotent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42417) Upgrade `netty` to version 4.1.88.Final

BingKun Pan created SPARK-42417:
---

 Summary:  Upgrade `netty` to version 4.1.88.Final
 Key: SPARK-42417
 URL: https://issues.apache.org/jira/browse/SPARK-42417
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42416) Dateset.show() should not resolve the analyzed logical plan again



 [ 
https://issues.apache.org/jira/browse/SPARK-42416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42416:


Assignee: Gengliang Wang  (was: Apache Spark)

> Dateset.show() should not resolve the analyzed logical plan again
> -
>
> Key: SPARK-42416
> URL: https://issues.apache.org/jira/browse/SPARK-42416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> For the following query
>  
> {code:java}
>   sql(
> """
>   |CREATE TABLE app_open (
>   |  uid STRING,
>   |  st TIMESTAMP,
>   |  ds INT
>   |) USING parquet PARTITIONED BY (ds);
>   |""".stripMargin)
>   sql(
> """
>   |create or replace temporary view group_by_error as WITH 
> new_app_open AS (
>   |  SELECT
>   |ao.*
>   |  FROM
>   |app_open ao
>   |)
>   |SELECT
>   |uid,
>   |20230208 AS ds
>   |  FROM
>   |new_app_open
>   |  GROUP BY
>   |1,
>   |2
>   |""".stripMargin)
>   sql(
> """
>   |select
>   |  `uid`
>   |from
>   |  group_by_error
>   |""".stripMargin).show(){code}
> Spark will throw the following error
>  
>  
> {code:java}
> [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list 
> (valid range is [1, 2]).; line 9 pos 4 {code}
>  
>  
> This is because the logical plan is not set as analyzed and it is analyzed 
> again. The analyzer rules about aggregation/sort ordinals are not idempotent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42416) Dateset.show() should not resolve the analyzed logical plan again



 [ 
https://issues.apache.org/jira/browse/SPARK-42416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42416:


Assignee: Apache Spark  (was: Gengliang Wang)

> Dateset.show() should not resolve the analyzed logical plan again
> -
>
> Key: SPARK-42416
> URL: https://issues.apache.org/jira/browse/SPARK-42416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> For the following query
>  
> {code:java}
>   sql(
> """
>   |CREATE TABLE app_open (
>   |  uid STRING,
>   |  st TIMESTAMP,
>   |  ds INT
>   |) USING parquet PARTITIONED BY (ds);
>   |""".stripMargin)
>   sql(
> """
>   |create or replace temporary view group_by_error as WITH 
> new_app_open AS (
>   |  SELECT
>   |ao.*
>   |  FROM
>   |app_open ao
>   |)
>   |SELECT
>   |uid,
>   |20230208 AS ds
>   |  FROM
>   |new_app_open
>   |  GROUP BY
>   |1,
>   |2
>   |""".stripMargin)
>   sql(
> """
>   |select
>   |  `uid`
>   |from
>   |  group_by_error
>   |""".stripMargin).show(){code}
> Spark will throw the following error
>  
>  
> {code:java}
> [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list 
> (valid range is [1, 2]).; line 9 pos 4 {code}
>  
>  
> This is because the logical plan is not set as analyzed and it is analyzed 
> again. The analyzer rules about aggregation/sort ordinals are not idempotent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42416) Dateset.show() should not resolve the analyzed logical plan again



[ 
https://issues.apache.org/jira/browse/SPARK-42416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687747#comment-17687747
 ] 

Apache Spark commented on SPARK-42416:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39988

> Dateset.show() should not resolve the analyzed logical plan again
> -
>
> Key: SPARK-42416
> URL: https://issues.apache.org/jira/browse/SPARK-42416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> For the following query
>  
> {code:java}
>   sql(
> """
>   |CREATE TABLE app_open (
>   |  uid STRING,
>   |  st TIMESTAMP,
>   |  ds INT
>   |) USING parquet PARTITIONED BY (ds);
>   |""".stripMargin)
>   sql(
> """
>   |create or replace temporary view group_by_error as WITH 
> new_app_open AS (
>   |  SELECT
>   |ao.*
>   |  FROM
>   |app_open ao
>   |)
>   |SELECT
>   |uid,
>   |20230208 AS ds
>   |  FROM
>   |new_app_open
>   |  GROUP BY
>   |1,
>   |2
>   |""".stripMargin)
>   sql(
> """
>   |select
>   |  `uid`
>   |from
>   |  group_by_error
>   |""".stripMargin).show(){code}
> Spark will throw the following error
>  
>  
> {code:java}
> [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list 
> (valid range is [1, 2]).; line 9 pos 4 {code}
>  
>  
> This is because the logical plan is not set as analyzed and it is analyzed 
> again. The analyzer rules about aggregation/sort ordinals are not idempotent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-14922) Alter Table Drop Partition Using Predicate-based Partition Spec



 [ 
https://issues.apache.org/jira/browse/SPARK-14922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-14922:
--
Target Version/s:   (was: 3.5.0)

> Alter Table Drop Partition Using Predicate-based Partition Spec
> ---
>
> Key: SPARK-14922
> URL: https://issues.apache.org/jira/browse/SPARK-14922
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.2, 2.2.1
>Reporter: Xiao Li
>Priority: Major
>
> Below is allowed in Hive, but not allowed in Spark.
> {noformat}
> alter table ptestfilter drop partition (c='US', d<'2')
> {noformat}
> This example is copied from drop_partitions_filter.q



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications



 [ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41053:
--
Fix Version/s: 3.4.0

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.4.0
>
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications



[ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687745#comment-17687745
 ] 

Dongjoon Hyun commented on SPARK-41053:
---

Thank you for leading and completing this, [~Gengliang.Wang]. I assigned this 
issue to Gengliang to shine his leadership. Thank you all.

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: releasenotes
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications



 [ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41053:
-

Assignee: Gengliang Wang  (was: Apache Spark)

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>  Labels: releasenotes
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes



[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687744#comment-17687744
 ] 

Dongjoon Hyun commented on SPARK-34645:
---

Thank you for sharing your experience and several combinations you tried, 
[~hussein-awala]. 
- Is the JVM terminated?
- If not, what kind of JVM threads do you see in the driver pod?

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41775) Implement training functions as input



[ 
https://issues.apache.org/jira/browse/SPARK-41775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687743#comment-17687743
 ] 

Apache Spark commented on SPARK-41775:
--

User 'rithwik-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/39987

> Implement training functions as input
> -
>
> Key: SPARK-41775
> URL: https://issues.apache.org/jira/browse/SPARK-41775
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.4.0
>Reporter: Rithwik Ediga Lakhamsani
>Assignee: Rithwik Ediga Lakhamsani
>Priority: Major
> Fix For: 3.4.0
>
>
> Sidenote: make formatting updates described in 
> https://github.com/apache/spark/pull/39188
>  
> Currently, `Distributor().run(...)` takes only files as input. Now we will 
> add in additional functionality to take in functions as well. This will 
> require us to go through the following process on each task in the executor 
> nodes:
> 1. take the input function and args and pickle them
> 2. Create a temp train.py file that looks like
> {code:java}
> import cloudpickle
> import os
> if _name_ == "_main_":
>     train, args = cloudpickle.load(f"{tempdir}/train_input.pkl")
>     output = train(*args)
>     if output and os.environ.get("RANK", "") == "0": # this is for 
> partitionId == 0
>         cloudpickle.dump(f"{tempdir}/train_output.pkl") {code}
> 3. Run that train.py file with `torchrun`
> 4. Check if `train_output.pkl` has been created on process on partitionId == 
> 0, if it has, then deserialize it and return that output through `.collect()`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42323) Assign name to _LEGACY_ERROR_TEMP_2332



 [ 
https://issues.apache.org/jira/browse/SPARK-42323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42323:


Assignee: (was: Apache Spark)

> Assign name to _LEGACY_ERROR_TEMP_2332
> --
>
> Key: SPARK-42323
> URL: https://issues.apache.org/jira/browse/SPARK-42323
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42323) Assign name to _LEGACY_ERROR_TEMP_2332



 [ 
https://issues.apache.org/jira/browse/SPARK-42323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42323:


Assignee: Apache Spark

> Assign name to _LEGACY_ERROR_TEMP_2332
> --
>
> Key: SPARK-42323
> URL: https://issues.apache.org/jira/browse/SPARK-42323
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42323) Assign name to _LEGACY_ERROR_TEMP_2332



[ 
https://issues.apache.org/jira/browse/SPARK-42323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687742#comment-17687742
 ] 

Apache Spark commented on SPARK-42323:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/39977

> Assign name to _LEGACY_ERROR_TEMP_2332
> --
>
> Key: SPARK-42323
> URL: https://issues.apache.org/jira/browse/SPARK-42323
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42416) Dateset.show() should not resolve the analyzed logical plan again

2023-02-12 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-42416:
--

Assignee: Gengliang Wang

> Dateset.show() should not resolve the analyzed logical plan again
> -
>
> Key: SPARK-42416
> URL: https://issues.apache.org/jira/browse/SPARK-42416
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> For the following query
>  
> {code:java}
>   sql(
> """
>   |CREATE TABLE app_open (
>   |  uid STRING,
>   |  st TIMESTAMP,
>   |  ds INT
>   |) USING parquet PARTITIONED BY (ds);
>   |""".stripMargin)
>   sql(
> """
>   |create or replace temporary view group_by_error as WITH 
> new_app_open AS (
>   |  SELECT
>   |ao.*
>   |  FROM
>   |app_open ao
>   |)
>   |SELECT
>   |uid,
>   |20230208 AS ds
>   |  FROM
>   |new_app_open
>   |  GROUP BY
>   |1,
>   |2
>   |""".stripMargin)
>   sql(
> """
>   |select
>   |  `uid`
>   |from
>   |  group_by_error
>   |""".stripMargin).show(){code}
> Spark will throw the following error
>  
>  
> {code:java}
> [GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list 
> (valid range is [1, 2]).; line 9 pos 4 {code}
>  
>  
> This is because the logical plan is not set as analyzed and it is analyzed 
> again. The analyzer rules about aggregation/sort ordinals are not idempotent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42416) Dateset.show() should not resolve the analyzed logical plan again

2023-02-12 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-42416:
--

 Summary: Dateset.show() should not resolve the analyzed logical 
plan again
 Key: SPARK-42416
 URL: https://issues.apache.org/jira/browse/SPARK-42416
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gengliang Wang


For the following query

 
{code:java}
  sql(
"""
  |CREATE TABLE app_open (
  |  uid STRING,
  |  st TIMESTAMP,
  |  ds INT
  |) USING parquet PARTITIONED BY (ds);
  |""".stripMargin)

  sql(
"""
  |create or replace temporary view group_by_error as WITH new_app_open 
AS (
  |  SELECT
  |ao.*
  |  FROM
  |app_open ao
  |)
  |SELECT
  |uid,
  |20230208 AS ds
  |  FROM
  |new_app_open
  |  GROUP BY
  |1,
  |2
  |""".stripMargin)

  sql(
"""
  |select
  |  `uid`
  |from
  |  group_by_error
  |""".stripMargin).show(){code}
Spark will throw the following error

 

 
{code:java}
[GROUP_BY_POS_OUT_OF_RANGE] GROUP BY position 20230208 is not in select list 
(valid range is [1, 2]).; line 9 pos 4 {code}
 

 

This is because the logical plan is not set as analyzed and it is analyzed 
again. The analyzer rules about aggregation/sort ordinals are not idempotent.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42415) The built-in dialects support OFFSET and paging query.

2023-02-12 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-42415:
--

 Summary: The built-in dialects support OFFSET and paging query.
 Key: SPARK-42415
 URL: https://issues.apache.org/jira/browse/SPARK-42415
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.0
Reporter: jiaan.geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42269) Support complex return types in DDL strings



 [ 
https://issues.apache.org/jira/browse/SPARK-42269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42269.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39964
[https://github.com/apache/spark/pull/39964]

> Support complex return types in DDL strings
> ---
>
> Key: SPARK-42269
> URL: https://issues.apache.org/jira/browse/SPARK-42269
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> # Spark Connect
> >>> spark.range(2).select(udf(lambda x: (x, x), "struct >>> y:integer>")("id"))
> ...
> AssertionError: returnType should be singular
> # vanilla PySpark
> >>> spark.range(2).select(udf(lambda x: (x, x), "struct >>> y:integer>")("id"))
> DataFrame[(id): struct]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42269) Support complex return types in DDL strings



 [ 
https://issues.apache.org/jira/browse/SPARK-42269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42269:


Assignee: Xinrong Meng

> Support complex return types in DDL strings
> ---
>
> Key: SPARK-42269
> URL: https://issues.apache.org/jira/browse/SPARK-42269
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> {code}
> # Spark Connect
> >>> spark.range(2).select(udf(lambda x: (x, x), "struct >>> y:integer>")("id"))
> ...
> AssertionError: returnType should be singular
> # vanilla PySpark
> >>> spark.range(2).select(udf(lambda x: (x, x), "struct >>> y:integer>")("id"))
> DataFrame[(id): struct]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42034) QueryExecutionListener and Observation API, df.observe do not work with `foreach` action.



 [ 
https://issues.apache.org/jira/browse/SPARK-42034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42034.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39976
[https://github.com/apache/spark/pull/39976]

> QueryExecutionListener and Observation API, df.observe do not work with 
> `foreach` action.
> -
>
> Key: SPARK-42034
> URL: https://issues.apache.org/jira/browse/SPARK-42034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.2, 3.3.1
> Environment: I test it locally and on YARN in cluster mode.
> Spark 3.3.1 and 3.2.2 and 3.1.1.
> Yarn 2.9.2 and 3.2.1.
>Reporter: Nick Hryhoriev
>Assignee: ming95
>Priority: Major
>  Labels: sql-api
> Fix For: 3.5.0
>
>
> Observation API, {{observe}} dataframe transformation, and custom 
> QueryExecutionListener.
> Do not work with {{foreach}} or {{foreachPartition actions.}}
> {{This is due to }}QueryExecutionListener functions do not trigger on queries 
> whose action is {{foreach}} or {{{}foreachPartition{}}}.
> But the Spark GUI SQL tab sees this query as SQL query and shows its query 
> plans and etc.
> here is the code to reproduce it:
> https://gist.github.com/GrigorievNick/e7cf9ec5584b417d9719e2812722e6d3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42034) QueryExecutionListener and Observation API, df.observe do not work with `foreach` action.



 [ 
https://issues.apache.org/jira/browse/SPARK-42034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42034:


Assignee: ming95

> QueryExecutionListener and Observation API, df.observe do not work with 
> `foreach` action.
> -
>
> Key: SPARK-42034
> URL: https://issues.apache.org/jira/browse/SPARK-42034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.2, 3.3.1
> Environment: I test it locally and on YARN in cluster mode.
> Spark 3.3.1 and 3.2.2 and 3.1.1.
> Yarn 2.9.2 and 3.2.1.
>Reporter: Nick Hryhoriev
>Assignee: ming95
>Priority: Major
>  Labels: sql-api
>
> Observation API, {{observe}} dataframe transformation, and custom 
> QueryExecutionListener.
> Do not work with {{foreach}} or {{foreachPartition actions.}}
> {{This is due to }}QueryExecutionListener functions do not trigger on queries 
> whose action is {{foreach}} or {{{}foreachPartition{}}}.
> But the Spark GUI SQL tab sees this query as SQL query and shows its query 
> plans and etc.
> here is the code to reproduce it:
> https://gist.github.com/GrigorievNick/e7cf9ec5584b417d9719e2812722e6d3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42331) Fix metadata col can not been resolved

2023-02-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42331.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39870
[https://github.com/apache/spark/pull/39870]

> Fix metadata col can not been resolved
> --
>
> Key: SPARK-42331
> URL: https://issues.apache.org/jira/browse/SPARK-42331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42331) Fix metadata col can not been resolved

2023-02-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42331:
---

Assignee: XiDuo You  (was: Apache Spark)

> Fix metadata col can not been resolved
> --
>
> Key: SPARK-42331
> URL: https://issues.apache.org/jira/browse/SPARK-42331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42331) Fix metadata col can not been resolved

2023-02-12 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42331:
---

Assignee: Apache Spark

> Fix metadata col can not been resolved
> --
>
> Key: SPARK-42331
> URL: https://issues.apache.org/jira/browse/SPARK-42331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42414) Make suggestion in error message smarter

2023-02-12 Thread Haejoon Lee (Jira)

Haejoon Lee created SPARK-42414:
---

 Summary: Make suggestion in error message smarter
 Key: SPARK-42414
 URL: https://issues.apache.org/jira/browse/SPARK-42414
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.4.0
Reporter: Haejoon Lee


Current suggestion message e.g. `UNRESOLVED_COLUMN.WITH_SUGGESTION` just pumps 
out all columns, so it would not very helpful for users.

This can be better to show most related columns only to user facing error 
message.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42391) Close Live AppStore in the finally block for test cases



 [ 
https://issues.apache.org/jira/browse/SPARK-42391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42391:
-

Assignee: Yang Jie

> Close Live AppStore in the finally block for test cases
> ---
>
> Key: SPARK-42391
> URL: https://issues.apache.org/jira/browse/SPARK-42391
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> AppStatusStore#createLiveStore will return RocksDB backend AppStatusStore 
> when `
> LIVE_UI_LOCAL_STORE_DIR` or `LIVE_UI_LOCAL_STORE_DIR`  is configured, should 
> be closed it in finally block to release resources for test cases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42391) Close Live AppStore in the finally block for test cases



 [ 
https://issues.apache.org/jira/browse/SPARK-42391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42391.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39961
[https://github.com/apache/spark/pull/39961]

> Close Live AppStore in the finally block for test cases
> ---
>
> Key: SPARK-42391
> URL: https://issues.apache.org/jira/browse/SPARK-42391
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> AppStatusStore#createLiveStore will return RocksDB backend AppStatusStore 
> when `
> LIVE_UI_LOCAL_STORE_DIR` or `LIVE_UI_LOCAL_STORE_DIR`  is configured, should 
> be closed it in finally block to release resources for test cases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module



[ 
https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687679#comment-17687679
 ] 

Apache Spark commented on SPARK-42410:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39986

> Support Scala 2.12/2.13 tests in connect module
> ---
>
> Key: SPARK-42410
> URL: https://issues.apache.org/jira/browse/SPARK-42410
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package 
> "connect/test"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42263) Implement `spark.catalog.registerFunction`



 [ 
https://issues.apache.org/jira/browse/SPARK-42263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42263:


Assignee: (was: Apache Spark)

> Implement `spark.catalog.registerFunction`
> --
>
> Key: SPARK-42263
> URL: https://issues.apache.org/jira/browse/SPARK-42263
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42412) Initial prototype implementation for PySparkML



 [ 
https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42412:


Assignee: Apache Spark  (was: Weichen Xu)

> Initial prototype implementation for PySparkML
> --
>
> Key: SPARK-42412
> URL: https://issues.apache.org/jira/browse/SPARK-42412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42412) Initial prototype implementation for PySparkML



 [ 
https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42412:


Assignee: Weichen Xu  (was: Apache Spark)

> Initial prototype implementation for PySparkML
> --
>
> Key: SPARK-42412
> URL: https://issues.apache.org/jira/browse/SPARK-42412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42263) Implement `spark.catalog.registerFunction`



[ 
https://issues.apache.org/jira/browse/SPARK-42263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687678#comment-17687678
 ] 

Apache Spark commented on SPARK-42263:
--

User 'xinrong-meng' has created a pull request for this issue:
https://github.com/apache/spark/pull/39984

> Implement `spark.catalog.registerFunction`
> --
>
> Key: SPARK-42263
> URL: https://issues.apache.org/jira/browse/SPARK-42263
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42263) Implement `spark.catalog.registerFunction`



 [ 
https://issues.apache.org/jira/browse/SPARK-42263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42263:


Assignee: Apache Spark

> Implement `spark.catalog.registerFunction`
> --
>
> Key: SPARK-42263
> URL: https://issues.apache.org/jira/browse/SPARK-42263
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42412) Initial prototype implementation for PySparkML



[ 
https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687677#comment-17687677
 ] 

Apache Spark commented on SPARK-42412:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/39985

> Initial prototype implementation for PySparkML
> --
>
> Key: SPARK-42412
> URL: https://issues.apache.org/jira/browse/SPARK-42412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42413) Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1



 [ 
https://issues.apache.org/jira/browse/SPARK-42413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan resolved SPARK-42413.
-
Resolution: Duplicate

> Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1
> 
>
> Key: SPARK-42413
> URL: https://issues.apache.org/jira/browse/SPARK-42413
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-42413) Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1



 [ 
https://issues.apache.org/jira/browse/SPARK-42413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BingKun Pan closed SPARK-42413.
---

Duplicate

> Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1
> 
>
> Key: SPARK-42413
> URL: https://issues.apache.org/jira/browse/SPARK-42413
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42413) Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1

BingKun Pan created SPARK-42413:
---

 Summary: Upgrade zstd-jni from 1.5.2-5 to 1.5.4-1
 Key: SPARK-42413
 URL: https://issues.apache.org/jira/browse/SPARK-42413
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42411) Better support for Istio service mesh while running Spark on Kubernetes

2023-02-12 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42411?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687664#comment-17687664
 ] 

Holden Karau commented on SPARK-42411:
--

The other option is something around `spark.network.crypto.enabled`

> Better support for Istio service mesh while running Spark on Kubernetes
> ---
>
> Key: SPARK-42411
> URL: https://issues.apache.org/jira/browse/SPARK-42411
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.3
>Reporter: Puneet
>Priority: Major
>
> h3. Support for Strict MTLS
> In strict MTLS Peer Authentication Istio requires each pod to be associated 
> with a service identity (as this allows listeners to use the correct cert and 
> chain). Without the service identity communication goes through passthrough 
> cluster which is not permitted in strict mode. Community is still 
> investigating communication through IPs with strict MTLS 
> [https://github.com/istio/istio/issues/37431#issuecomment-1412831780]. Today 
> Spark backend creates a service record for driver however executor pods 
> register with driver using their Pod IPs. In this model therefore, TLS 
> handshake would fail between driver and executor and also between executors. 
> As part of this Jira we want to similarly add service records for the 
> executor pods as well. This can be achieved by adding a 
> ExecutorServiceFeatureStep similar to existing DriverServiceFeatureStep
> h3. Allowing binding to all IPs
> Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to 
> localhost of the pod. Thus if the application container is binding only to 
> Pod IP the traffic would not be forwarded to it. This was addressed in 1.10 
> [https://istio.io/latest/blog/2021/upcoming-networking-changes]. However the 
> old behavior is still accessible through disabling the feature flag 
> PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back 
> [https://github.com/istio/istio/issues/37642]. In current implementation 
> Spark K8s backend does not allow to pass bind address for driver 
> [https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35]
>  however as part of this Jira we want to allow passing of bind address even 
> in Kubernetes mode so long as the bind address is 0.0.0.0. This lets user 
> choose the behavior depending on the state of 
> PILOT_ENABLE_INBOUND_PASSTHROUGH in her Istio cluster.
> h3. Better support for istio-proxy sidecar lifecycle management
> In istio-enabled cluster istio-proxy sidecars would be auto-injected to 
> driver/executor pods. If the application is ephemeral then driver and 
> executor containers would exit, however istio-proxy container would continue 
> to run. This causes driver/executor pods to enter NotReady state. As part of 
> this jira we want ability to run a post stop cleanup after driver/executor 
> container is completed. Similarly we also want to add support for a pre start 
> up script, which can ensure for example that istio-sidecar is up before 
> executor/driver container gets started.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42412) Initial prototype implementation for PySparkML

2023-02-12 Thread Weichen Xu (Jira)

Weichen Xu created SPARK-42412:
--

 Summary: Initial prototype implementation for PySparkML
 Key: SPARK-42412
 URL: https://issues.apache.org/jira/browse/SPARK-42412
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Weichen Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42263) Implement `spark.catalog.registerFunction`

2023-02-12 Thread Xinrong Meng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xinrong Meng updated SPARK-42263:
-
Parent: SPARK-41661
Issue Type: Sub-task  (was: Improvement)

> Implement `spark.catalog.registerFunction`
> --
>
> Key: SPARK-42263
> URL: https://issues.apache.org/jira/browse/SPARK-42263
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42412) Initial prototype implementation for PySparkML

2023-02-12 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-42412:
--

Assignee: Weichen Xu

> Initial prototype implementation for PySparkML
> --
>
> Key: SPARK-42412
> URL: https://issues.apache.org/jira/browse/SPARK-42412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42411) Better support for Istio service mesh while running Spark on Kubernetes

[
https://issues.apache.org/jira/browse/SPARK-42411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Puneet updated SPARK-42411:
---
Description:
h3. Support for Strict MTLS

In strict MTLS Peer Authentication Istio requires each pod to be associated
with a service identity (as this allows listeners to use the correct cert and
chain). Without the service identity communication goes through passthrough
cluster which is not permitted in strict mode. Community is still investigating
communication through IPs with strict MTLS
[https://github.com/istio/istio/issues/37431#issuecomment-1412831780]. Today
Spark backend creates a service record for driver however executor pods
register with driver using their Pod IPs. In this model therefore, TLS
handshake would fail between driver and executor and also between executors. As
part of this Jira we want to similarly add service records for the executor
pods as well. This can be achieved by adding a ExecutorServiceFeatureStep
similar to existing DriverServiceFeatureStep
h3. Allowing binding to all IPs

Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to
localhost of the pod. Thus if the application container is binding only to Pod
IP the traffic would not be forwarded to it. This was addressed in 1.10
[https://istio.io/latest/blog/2021/upcoming-networking-changes]. However the
old behavior is still accessible through disabling the feature flag
PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back
[https://github.com/istio/istio/issues/37642]. In current implementation Spark
K8s backend does not allow to pass bind address for driver
[https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35]
however as part of this Jira we want to allow passing of bind address even in
Kubernetes mode so long as the bind address is 0.0.0.0. This lets user choose
the behavior depending on the state of PILOT_ENABLE_INBOUND_PASSTHROUGH in her
Istio cluster.
h3. Better support for istio-proxy sidecar lifecycle management

In istio-enabled cluster istio-proxy sidecars would be auto-injected to
driver/executor pods. If the application is ephemeral then driver and executor
containers would exit, however istio-proxy container would continue to run.
This causes driver/executor pods to enter NotReady state. As part of this jira
we want ability to run a post stop cleanup after driver/executor container is
completed. Similarly we also want to add support for a pre start up script,
which can ensure for example that istio-sidecar is up before executor/driver
container gets started.

was:
h3. Support for Strict MTLS

In strict MTLS Peer Authentication Istio requires each pod to be associated
with a service identity (as this allows listeners to use the correct cert and
chain). Without the service identity communication goes through passthrough
cluster which is not permitted in strict mode. Community is still investigating
communication through IPs with strict MTLS
https://github.com/istio/istio/issues/37431#issuecomment-1412831780. Today
Spark backend creates a service record for driver however executor pods
register with pod ip with driver. In this model therefore, TLS handshake would
fail between driver and executor and also between executors. As part of this
jira we want to similarly add service records for the executor pods as well.
This can be achieved by adding a ExecutorServiceFeatureStep similar to existing
DriverServiceFeatureStep
h3. Allowing binding to all IPs

Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to
localhost of the pod. Thus is the application container is binding only to Pod
IP the traffic would not be forwarded to it. This was addressed in 1.10
https://istio.io/latest/blog/2021/upcoming-networking-changes. However the old
behavior is still accessible through disabling the feature flag
PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back
https://github.com/istio/istio/issues/37642. In current implementation Spark
K8s backend does not allow to pass bind address for driver
https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35
however as part of this jira we want to allow passing of bind address even in
Kubernetes mode so long as the bind address is 0.0.0.0. This lets user choose
the behavior dependening on state of PILOT_ENABLE_INBOUND_PASSTHROUGH in her
Istio cluster.
h3. Better support for istio-proxy sidecar lifecycle management

[jira] [Resolved] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module



 [ 
https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42410.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39982
[https://github.com/apache/spark/pull/39982]

> Support Scala 2.12/2.13 tests in connect module
> ---
>
> Key: SPARK-42410
> URL: https://issues.apache.org/jira/browse/SPARK-42410
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package 
> "connect/test"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module



 [ 
https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42410:
-

Assignee: Dongjoon Hyun

> Support Scala 2.12/2.13 tests in connect module
> ---
>
> Key: SPARK-42410
> URL: https://issues.apache.org/jira/browse/SPARK-42410
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> {code}
> $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package 
> "connect/test"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42411) Better support for Spark on Kubernetes while using Istio service mesh



 [ 
https://issues.apache.org/jira/browse/SPARK-42411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Puneet updated SPARK-42411:
---
Summary: Better support for Spark on Kubernetes while using Istio service 
mesh  (was: Add support for istio in strict MTLS PeerAuthentication)

> Better support for Spark on Kubernetes while using Istio service mesh
> -
>
> Key: SPARK-42411
> URL: https://issues.apache.org/jira/browse/SPARK-42411
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.3
>Reporter: Puneet
>Priority: Major
>
> h3. Support for Strict MTLS
> In strict MTLS Peer Authentication Istio requires each pod to be associated 
> with a service identity (as this allows listeners to use the correct cert and 
> chain). Without the service identity communication goes through passthrough 
> cluster which is not permitted in strict mode. Community is still 
> investigating communication through IPs with strict MTLS 
> https://github.com/istio/istio/issues/37431#issuecomment-1412831780. Today 
> Spark backend creates a service record for driver however executor pods 
> register with pod ip with driver. In this model therefore, TLS handshake 
> would fail between driver and executor and also between executors. As part of 
> this jira we want to similarly add service records for the executor pods as 
> well. This can be achieved by adding a ExecutorServiceFeatureStep similar to 
> existing DriverServiceFeatureStep
> h3. Allowing binding to all IPs
> Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to 
> localhost of the pod. Thus is the application container is binding only to 
> Pod IP the traffic would not be forwarded to it. This was addressed in 1.10 
> https://istio.io/latest/blog/2021/upcoming-networking-changes. However the 
> old behavior is still accessible through disabling the feature flag 
> PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back 
> https://github.com/istio/istio/issues/37642. In current implementation Spark 
> K8s backend does not allow to pass bind address for driver 
> https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35
>  however as part of this jira we want to allow passing of bind address even 
> in Kubernetes mode so long as the bind address is 0.0.0.0. This lets user 
> choose the behavior dependening on state of PILOT_ENABLE_INBOUND_PASSTHROUGH 
> in her Istio cluster.
> h3. Better support for istio-proxy sidecar lifecycle management
> In istio-enabled cluster istio-proxy sidecars would be auto-injected to 
> driver/executor pods. If the application is ephemeral then driver and 
> executor containers would exit, however istio-proxy container would continue 
> to run. This causes driver/executor pods to enter NotReady state. As part of 
> this jira we want ability to run a post stop cleanup after driver/executor 
> container is completed. Similarly we also want to add support for a pre start 
> up script, which can ensure for example that istio-sidecar is up before 
> executor/driver container gets started.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42411) Better support for Istio service mesh while running Spark on Kubernetes



 [ 
https://issues.apache.org/jira/browse/SPARK-42411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Puneet updated SPARK-42411:
---
Summary: Better support for Istio service mesh while running Spark on 
Kubernetes  (was: Better support for Spark on Kubernetes while using Istio 
service mesh)

> Better support for Istio service mesh while running Spark on Kubernetes
> ---
>
> Key: SPARK-42411
> URL: https://issues.apache.org/jira/browse/SPARK-42411
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.2.3
>Reporter: Puneet
>Priority: Major
>
> h3. Support for Strict MTLS
> In strict MTLS Peer Authentication Istio requires each pod to be associated 
> with a service identity (as this allows listeners to use the correct cert and 
> chain). Without the service identity communication goes through passthrough 
> cluster which is not permitted in strict mode. Community is still 
> investigating communication through IPs with strict MTLS 
> https://github.com/istio/istio/issues/37431#issuecomment-1412831780. Today 
> Spark backend creates a service record for driver however executor pods 
> register with pod ip with driver. In this model therefore, TLS handshake 
> would fail between driver and executor and also between executors. As part of 
> this jira we want to similarly add service records for the executor pods as 
> well. This can be achieved by adding a ExecutorServiceFeatureStep similar to 
> existing DriverServiceFeatureStep
> h3. Allowing binding to all IPs
> Before Istio 1.10 the istio-proxy sidecar was forwarding outside traffic to 
> localhost of the pod. Thus is the application container is binding only to 
> Pod IP the traffic would not be forwarded to it. This was addressed in 1.10 
> https://istio.io/latest/blog/2021/upcoming-networking-changes. However the 
> old behavior is still accessible through disabling the feature flag 
> PILOT_ENABLE_INBOUND_PASSTHROUGH. Request to remove it has had some push back 
> https://github.com/istio/istio/issues/37642. In current implementation Spark 
> K8s backend does not allow to pass bind address for driver 
> https://github.com/apache/spark/blob/master/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/DriverServiceFeatureStep.scala#L35
>  however as part of this jira we want to allow passing of bind address even 
> in Kubernetes mode so long as the bind address is 0.0.0.0. This lets user 
> choose the behavior dependening on state of PILOT_ENABLE_INBOUND_PASSTHROUGH 
> in her Istio cluster.
> h3. Better support for istio-proxy sidecar lifecycle management
> In istio-enabled cluster istio-proxy sidecars would be auto-injected to 
> driver/executor pods. If the application is ephemeral then driver and 
> executor containers would exit, however istio-proxy container would continue 
> to run. This causes driver/executor pods to enter NotReady state. As part of 
> this jira we want ability to run a post stop cleanup after driver/executor 
> container is completed. Similarly we also want to add support for a pre start 
> up script, which can ensure for example that istio-sidecar is up before 
> executor/driver container gets started.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42411) Add support for istio in strict MTLS PeerAuthentication

Puneet created SPARK-42411:
--

Summary: Add support for istio in strict MTLS PeerAuthentication
Key: SPARK-42411
URL: https://issues.apache.org/jira/browse/SPARK-42411
Project: Spark
Issue Type: New Feature
Components: Kubernetes
Affects Versions: 3.2.3
Reporter: Puneet

h3. Support for Strict MTLS

In strict MTLS Peer Authentication Istio requires each pod to be associated
with a service identity (as this allows listeners to use the correct cert and
chain). Without the service identity communication goes through passthrough
cluster which is not permitted in strict mode. Community is still investigating
communication through IPs with strict MTLS
https://github.com/istio/istio/issues/37431#issuecomment-1412831780. Today
Spark backend creates a service record for driver however executor pods
register with pod ip with driver. In this model therefore, TLS handshake would
fail between driver and executor and also between executors. As part of this
jira we want to similarly add service records for the executor pods as well.
This can be achieved by adding a ExecutorServiceFeatureStep similar to existing
DriverServiceFeatureStep
h3. Allowing binding to all IPs

--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41963) Different exception message in DataFrame.unpivot



 [ 
https://issues.apache.org/jira/browse/SPARK-41963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41963:


Assignee: Takuya Ueshin  (was: Hyukjin Kwon)

> Different exception message in DataFrame.unpivot
> 
>
> Key: SPARK-41963
> URL: https://issues.apache.org/jira/browse/SPARK-41963
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.0
>
>
> Running {{test_parity_dataframe DataFrameParityTests.test_unpivot_negative}} 
> fails as below:
> {code}
> with self.subTest(desc="with no value columns"):
> for values in [[], ()]:
> with self.subTest(values=values):
> with self.assertRaisesRegex(
> Exception,  # (AnalysisException, 
> SparkConnectException)
> r".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one 
> value column "
> r"needs to be specified for UNPIVOT, all columns 
> specified as ids.*",
> ):
> >   df.unpivot("id", values, "var", "val").collect()
> E   AssertionError: ".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] 
> At least one value column needs to be specified for UNPIVOT, all columns 
> specified as ids.*" does not match "[UNPIVOT_VALUE_DATA_TYPE_MISMATCH] 
> Unpivot value columns must share a least common type, some types do not: 
> ["BIGINT" (`int`), "DOUBLE" (`double`), "STRING" (`str`)]
> E   Plan: 'Unpivot ArraySeq(id#2947L), 
> List(List(int#2948L), List(double#2949), List(str#2950)), var, [val]
> E   +- Project [id#2939L AS id#2947L, int#2940L AS 
> int#2948L, double#2941 AS double#2949, str#2942 AS str#2950]
> E  +- LocalRelation [id#2939L, int#2940L, 
> double#2941, str#2942]
> E   "
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41715) Catch specific exceptions for both Spark Connect and PySpark



[ 
https://issues.apache.org/jira/browse/SPARK-41715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687655#comment-17687655
 ] 

Apache Spark commented on SPARK-41715:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39983

> Catch specific exceptions for both Spark Connect and PySpark
> 
>
> Key: SPARK-41715
> URL: https://issues.apache.org/jira/browse/SPARK-41715
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> In python/pyspark/sql/tests/test_catalog.py, we should catch more specific 
> exceptions such as AnalysisException. The test is shared in both Spark 
> Connect and PySpark so we should figure the way out to share it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41963) Different exception message in DataFrame.unpivot



 [ 
https://issues.apache.org/jira/browse/SPARK-41963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41963:


Assignee: Hyukjin Kwon

> Different exception message in DataFrame.unpivot
> 
>
> Key: SPARK-41963
> URL: https://issues.apache.org/jira/browse/SPARK-41963
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Running {{test_parity_dataframe DataFrameParityTests.test_unpivot_negative}} 
> fails as below:
> {code}
> with self.subTest(desc="with no value columns"):
> for values in [[], ()]:
> with self.subTest(values=values):
> with self.assertRaisesRegex(
> Exception,  # (AnalysisException, 
> SparkConnectException)
> r".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one 
> value column "
> r"needs to be specified for UNPIVOT, all columns 
> specified as ids.*",
> ):
> >   df.unpivot("id", values, "var", "val").collect()
> E   AssertionError: ".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] 
> At least one value column needs to be specified for UNPIVOT, all columns 
> specified as ids.*" does not match "[UNPIVOT_VALUE_DATA_TYPE_MISMATCH] 
> Unpivot value columns must share a least common type, some types do not: 
> ["BIGINT" (`int`), "DOUBLE" (`double`), "STRING" (`str`)]
> E   Plan: 'Unpivot ArraySeq(id#2947L), 
> List(List(int#2948L), List(double#2949), List(str#2950)), var, [val]
> E   +- Project [id#2939L AS id#2947L, int#2940L AS 
> int#2948L, double#2941 AS double#2949, str#2942 AS str#2950]
> E  +- LocalRelation [id#2939L, int#2940L, 
> double#2941, str#2942]
> E   "
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41963) Different exception message in DataFrame.unpivot



 [ 
https://issues.apache.org/jira/browse/SPARK-41963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41963.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39960
[https://github.com/apache/spark/pull/39960]

> Different exception message in DataFrame.unpivot
> 
>
> Key: SPARK-41963
> URL: https://issues.apache.org/jira/browse/SPARK-41963
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Running {{test_parity_dataframe DataFrameParityTests.test_unpivot_negative}} 
> fails as below:
> {code}
> with self.subTest(desc="with no value columns"):
> for values in [[], ()]:
> with self.subTest(values=values):
> with self.assertRaisesRegex(
> Exception,  # (AnalysisException, 
> SparkConnectException)
> r".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] At least one 
> value column "
> r"needs to be specified for UNPIVOT, all columns 
> specified as ids.*",
> ):
> >   df.unpivot("id", values, "var", "val").collect()
> E   AssertionError: ".*\[UNPIVOT_REQUIRES_VALUE_COLUMNS] 
> At least one value column needs to be specified for UNPIVOT, all columns 
> specified as ids.*" does not match "[UNPIVOT_VALUE_DATA_TYPE_MISMATCH] 
> Unpivot value columns must share a least common type, some types do not: 
> ["BIGINT" (`int`), "DOUBLE" (`double`), "STRING" (`str`)]
> E   Plan: 'Unpivot ArraySeq(id#2947L), 
> List(List(int#2948L), List(double#2949), List(str#2950)), var, [val]
> E   +- Project [id#2939L AS id#2947L, int#2940L AS 
> int#2948L, double#2941 AS double#2949, str#2942 AS str#2950]
> E  +- LocalRelation [id#2939L, int#2940L, 
> double#2941, str#2942]
> E   "
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41715) Catch specific exceptions for both Spark Connect and PySpark



[ 
https://issues.apache.org/jira/browse/SPARK-41715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687653#comment-17687653
 ] 

Apache Spark commented on SPARK-41715:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39983

> Catch specific exceptions for both Spark Connect and PySpark
> 
>
> Key: SPARK-41715
> URL: https://issues.apache.org/jira/browse/SPARK-41715
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> In python/pyspark/sql/tests/test_catalog.py, we should catch more specific 
> exceptions such as AnalysisException. The test is shared in both Spark 
> Connect and PySpark so we should figure the way out to share it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40453) Improve error handling for GRPC server



[ 
https://issues.apache.org/jira/browse/SPARK-40453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687651#comment-17687651
 ] 

Apache Spark commented on SPARK-40453:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39983

> Improve error handling for GRPC server
> --
>
> Key: SPARK-40453
> URL: https://issues.apache.org/jira/browse/SPARK-40453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Right now the errors are handled very rudimentary and do not produce proper 
> GRPC errors. This issue address the work needed to return proper errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41715) Catch specific exceptions for both Spark Connect and PySpark



[ 
https://issues.apache.org/jira/browse/SPARK-41715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687652#comment-17687652
 ] 

Apache Spark commented on SPARK-41715:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39983

> Catch specific exceptions for both Spark Connect and PySpark
> 
>
> Key: SPARK-41715
> URL: https://issues.apache.org/jira/browse/SPARK-41715
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 3.4.0
>
>
> In python/pyspark/sql/tests/test_catalog.py, we should catch more specific 
> exceptions such as AnalysisException. The test is shared in both Spark 
> Connect and PySpark so we should figure the way out to share it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40453) Improve error handling for GRPC server



[ 
https://issues.apache.org/jira/browse/SPARK-40453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687650#comment-17687650
 ] 

Apache Spark commented on SPARK-40453:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39983

> Improve error handling for GRPC server
> --
>
> Key: SPARK-40453
> URL: https://issues.apache.org/jira/browse/SPARK-40453
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.2.2
>Reporter: Martin Grund
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> Right now the errors are handled very rudimentary and do not produce proper 
> GRPC errors. This issue address the work needed to return proper errors.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module



[ 
https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687643#comment-17687643
 ] 

Apache Spark commented on SPARK-42410:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39982

> Support Scala 2.12/2.13 tests in connect module
> ---
>
> Key: SPARK-42410
> URL: https://issues.apache.org/jira/browse/SPARK-42410
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package 
> "connect/test"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module



 [ 
https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42410:


Assignee: Apache Spark

> Support Scala 2.12/2.13 tests in connect module
> ---
>
> Key: SPARK-42410
> URL: https://issues.apache.org/jira/browse/SPARK-42410
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> {code}
> $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package 
> "connect/test"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module



 [ 
https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42410:


Assignee: (was: Apache Spark)

> Support Scala 2.12/2.13 tests in connect module
> ---
>
> Key: SPARK-42410
> URL: https://issues.apache.org/jira/browse/SPARK-42410
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package 
> "connect/test"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module



[ 
https://issues.apache.org/jira/browse/SPARK-42410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687642#comment-17687642
 ] 

Apache Spark commented on SPARK-42410:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39982

> Support Scala 2.12/2.13 tests in connect module
> ---
>
> Key: SPARK-42410
> URL: https://issues.apache.org/jira/browse/SPARK-42410
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> {code}
> $ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package 
> "connect/test"
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42410) Support Scala 2.12/2.13 tests in connect module

Dongjoon Hyun created SPARK-42410:
-

 Summary: Support Scala 2.12/2.13 tests in connect module
 Key: SPARK-42410
 URL: https://issues.apache.org/jira/browse/SPARK-42410
 Project: Spark
  Issue Type: Bug
  Components: Connect, Tests
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun


{code}
$ build/sbt -Dscala.version=2.13.8 -Pscala-2.13 -Phadoop-3 assembly/package 
"connect/test"
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42405) Better documentation of array_insert function



 [ 
https://issues.apache.org/jira/browse/SPARK-42405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42405:


Assignee: (was: Apache Spark)

> Better documentation of array_insert function
> -
>
> Key: SPARK-42405
> URL: https://issues.apache.org/jira/browse/SPARK-42405
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel Davies
>Priority: Trivial
>
> See the following thread for discussion: 
> https://github.com/apache/spark/pull/38867#discussion_r1097054656



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42405) Better documentation of array_insert function



[ 
https://issues.apache.org/jira/browse/SPARK-42405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687637#comment-17687637
 ] 

Apache Spark commented on SPARK-42405:
--

User 'Daniel-Davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/39975

> Better documentation of array_insert function
> -
>
> Key: SPARK-42405
> URL: https://issues.apache.org/jira/browse/SPARK-42405
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel Davies
>Priority: Trivial
>
> See the following thread for discussion: 
> https://github.com/apache/spark/pull/38867#discussion_r1097054656



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42405) Better documentation of array_insert function



 [ 
https://issues.apache.org/jira/browse/SPARK-42405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42405:


Assignee: Apache Spark

> Better documentation of array_insert function
> -
>
> Key: SPARK-42405
> URL: https://issues.apache.org/jira/browse/SPARK-42405
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Daniel Davies
>Assignee: Apache Spark
>Priority: Trivial
>
> See the following thread for discussion: 
> https://github.com/apache/spark/pull/38867#discussion_r1097054656



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42409) Upgrade ZSTD-JNI to 1.5.4-1



 [ 
https://issues.apache.org/jira/browse/SPARK-42409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42409:
-

Assignee: Yang Jie

> Upgrade ZSTD-JNI to 1.5.4-1
> ---
>
> Key: SPARK-42409
> URL: https://issues.apache.org/jira/browse/SPARK-42409
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42409) Upgrade ZSTD-JNI to 1.5.4-1



 [ 
https://issues.apache.org/jira/browse/SPARK-42409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42409.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39981
[https://github.com/apache/spark/pull/39981]

> Upgrade ZSTD-JNI to 1.5.4-1
> ---
>
> Key: SPARK-42409
> URL: https://issues.apache.org/jira/browse/SPARK-42409
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42401) Incorrect results or NPE when inserting null value into array using array_insert/array_append

2023-02-12 Thread Bruce Robbins (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-42401:
--
Summary: Incorrect results or NPE when inserting null value into array 
using array_insert/array_append  (was: Incorrect results or NPE when inserting 
null value using array_insert/array_append)

> Incorrect results or NPE when inserting null value into array using 
> array_insert/array_append
> -
>
> Key: SPARK-42401
> URL: https://issues.apache.org/jira/browse/SPARK-42401
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: correctness
>
> Example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array(1, 2, 3, 4), 5, 5),
> (array(1, 2, 3, 4), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> This produces an incorrect result:
> {noformat}
> [1,2,3,4,5]
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> A more succint example:
> {noformat}
> select array_insert(array(1, 2, 3, 4), 5, cast(null as int));
> {noformat}
> This also produces an incorrect result:
> {noformat}
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> {noformat}
> Another example:
> {noformat}
> create or replace temp view v1 as
> select * from values
> (array('1', '2', '3', '4'), 5, '5'),
> (array('1', '2', '3', '4'), 5, null)
> as v1(col1,col2,col3);
> select array_insert(col1, col2, col3) from v1;
> {noformat}
> The above query throws a {{NullPointerException}}:
> {noformat}
> 23/02/10 11:08:05 ERROR SparkSQLDriver: Failed in [select array_insert(col1, 
> col2, col3) from v1]
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.LocalTableScanExec.$anonfun$unsafeRows$1(LocalTableScanExec.scala:44)
> {noformat}
> {{array_append}} has the same issue:
> {noformat}
> spark-sql> select array_append(array(1, 2, 3, 4), cast(null as int));
> [1,2,3,4,0] <== should be [1,2,3,4,null]
> Time taken: 3.679 seconds, Fetched 1 row(s)
> spark-sql> select array_append(array('1', '2', '3', '4'), cast(null as 
> string));
> 23/02/10 11:13:36 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.NullPointerException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.UnsafeWriter.write(UnsafeWriter.java:110)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42400) Code clean up in org.apache.spark.storage

2023-02-12 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42400.
--
Fix Version/s: 3.5.0
 Assignee: Khalid Mammadov
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/39932

> Code clean up in org.apache.spark.storage
> -
>
> Key: SPARK-42400
> URL: https://issues.apache.org/jira/browse/SPARK-42400
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Assignee: Khalid Mammadov
>Priority: Trivial
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42400) Code clean up in org.apache.spark.storage

2023-02-12 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-42400:
-
Priority: Trivial  (was: Major)

> Code clean up in org.apache.spark.storage
> -
>
> Key: SPARK-42400
> URL: https://issues.apache.org/jira/browse/SPARK-42400
> Project: Spark
>  Issue Type: Improvement
>  Components: Block Manager
>Affects Versions: 3.4.0
>Reporter: Khalid Mammadov
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42312) Assign name to _LEGACY_ERROR_TEMP_0042

2023-02-12 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42312.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39951
[https://github.com/apache/spark/pull/39951]

> Assign name to _LEGACY_ERROR_TEMP_0042
> --
>
> Key: SPARK-42312
> URL: https://issues.apache.org/jira/browse/SPARK-42312
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42312) Assign name to _LEGACY_ERROR_TEMP_0042

2023-02-12 Thread Max Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42312:


Assignee: Haejoon Lee

> Assign name to _LEGACY_ERROR_TEMP_0042
> --
>
> Key: SPARK-42312
> URL: https://issues.apache.org/jira/browse/SPARK-42312
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40678) JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13

2023-02-12 Thread Wei Guo (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17687573#comment-17687573
 ] 

Wei Guo commented on SPARK-40678:
-

Fixed by PR 38154 https://github.com/apache/spark/pull/38154

> JSON conversion of ArrayType is not properly supported in Spark 3.2/2.13
> 
>
> Key: SPARK-40678
> URL: https://issues.apache.org/jira/browse/SPARK-40678
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.2.0
>Reporter: Cédric Chantepie
>Priority: Major
>
> In Spark 3.2 (Scala 2.13), values with {{ArrayType}} are no longer properly 
> support with JSON; e.g.
> {noformat}
> import org.apache.spark.sql.SparkSession
> case class KeyValue(key: String, value: Array[Byte])
> val spark = 
> SparkSession.builder().master("local[1]").appName("test").getOrCreate()
> import spark.implicits._
> val df = Seq(Array(KeyValue("foo", "bar".getBytes))).toDF()
> df.foreach(r => println(r.json))
> {noformat}
> Expected:
> {noformat}
> [{foo, bar}]
> {noformat}
> Encountered:
> {noformat}
> java.lang.IllegalArgumentException: Failed to convert value 
> ArraySeq([foo,[B@dcdb68f]) (class of class 
> scala.collection.mutable.ArraySeq$ofRef}) with the type of 
> ArrayType(Seq(StructField(key,StringType,false), 
> StructField(value,BinaryType,false)),true) to JSON.
>   at org.apache.spark.sql.Row.toJson$1(Row.scala:604)
>   at org.apache.spark.sql.Row.jsonValue(Row.scala:613)
>   at org.apache.spark.sql.Row.jsonValue$(Row.scala:552)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.jsonValue(rows.scala:166)
>   at org.apache.spark.sql.Row.json(Row.scala:535)
>   at org.apache.spark.sql.Row.json$(Row.scala:535)
>   at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.json(rows.scala:166)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42409) Upgrade ZSTD-JNI to 1.5.4-1