[jira] [Comment Edited] (SPARK-33826) InsertIntoHiveTable generate HDFS file with invalid user

2024-04-23 Thread Shawn Lavelle (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840268#comment-17840268
 ] 

Shawn Lavelle edited comment on SPARK-33826 at 4/24/24 2:41 AM:


[~angerszhuuu]  RIK - Replacement In Kind. The ODBC / Hive Proxy User should 
have remained.  I use it in a custom data source to avoid elevate permissions 
violations. Will your PR restore that?  I hope to test it out in a few days 
here.


was (Author: azeroth2b):
[~angerszhuuu]  RIK - Replacement In Kind. The ODBC / Hive Proxy User should 
have remained accessible from a data source. Will your PR restore that?  I hope 
to test it out in a few days here.

> InsertIntoHiveTable generate HDFS file with invalid user
> 
>
> Key: SPARK-33826
> URL: https://issues.apache.org/jira/browse/SPARK-33826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 3.0.0
>Reporter: Zhang Jianguo
>Priority: Minor
>
> *Arch:* Hive on Spark.
>  
> *Version:* Spark 2.3.2
>  
> *Conf:*
> Enable user impersonation
> hive.server2.enable.doAs=true
>  
> *Scenario:*
> Thriftserver is running with loginUser A, and Task  run as User A too.
> Client execute SQL with user B
>  
> Data generated by sql "insert into TABLE  \[tbl\] select XXX form ." is 
> written to HDFS on executor, executor doesn't know B.
>  
> *{color:#de350b}So the user file written to HDFS will be user A which should 
> be B.{color}*
>  
> I also check the inplementation of Spark 3.0.0, It could have the same issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33826) InsertIntoHiveTable generate HDFS file with invalid user

2024-04-23 Thread Shawn Lavelle (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840268#comment-17840268
 ] 

Shawn Lavelle commented on SPARK-33826:
---

[~angerszhuuu]  RIK - Replacement In Kind. The ODBC / Hive Proxy User should 
have remained accessible from a data source. Will your PR restore that?  I hope 
to test it out in a few days here.

> InsertIntoHiveTable generate HDFS file with invalid user
> 
>
> Key: SPARK-33826
> URL: https://issues.apache.org/jira/browse/SPARK-33826
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 3.0.0
>Reporter: Zhang Jianguo
>Priority: Minor
>
> *Arch:* Hive on Spark.
>  
> *Version:* Spark 2.3.2
>  
> *Conf:*
> Enable user impersonation
> hive.server2.enable.doAs=true
>  
> *Scenario:*
> Thriftserver is running with loginUser A, and Task  run as User A too.
> Client execute SQL with user B
>  
> Data generated by sql "insert into TABLE  \[tbl\] select XXX form ." is 
> written to HDFS on executor, executor doesn't know B.
>  
> *{color:#de350b}So the user file written to HDFS will be user A which should 
> be B.{color}*
>  
> I also check the inplementation of Spark 3.0.0, It could have the same issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

2024-04-23 Thread TakawaAkirayo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840265#comment-17840265
 ] 

TakawaAkirayo commented on SPARK-47952:
---

Hi [~gurwls223]

I'd like to understand the design approach of SparkConnect and whether there 
are existing community solutions that implement the same requirements as mine.

As I understand it, it seems that end users need to start the SparkConnect 
server themselves before using the SparkConnect client to connect. If I'm 
correct, users (individuals or company organizations) need to address where and 
how to start the SparkConnect server, either by themselves or through centrally 
provisioned servers within the enterprise. 

If that's the case, does my approach in the PR align with the expected usage 
pattern If I want to leverage the Hadoop cluster for scaling the SparkConnect 
Server?

> Support retrieving the real SparkConnectService GRPC address and port 
> programmatically when running on Yarn
> ---
>
> Key: SPARK-47952
> URL: https://issues.apache.org/jira/browse/SPARK-47952
> Project: Spark
>  Issue Type: Story
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Priority: Minor
>  Labels: pull-request-available
>
> 1.User Story:
> Our data analysts and data scientists use Jupyter notebooks provisioned on 
> Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
> in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
> significant local memory if the job is heavy, and the total resource pool of 
> k8s for notebooks is limited. To leverage the abundant resources of our 
> Hadoop cluster for scalability purposes, we aim to utilize SparkConnect. This 
> allows the driver on Yarn with SparkConnectService started and uses 
> SparkConnect client to connect to the remote driver.
> To provide a seamless experience with one command startup for both server and 
> client, we've wrapped the following processes in one script:
> 1) Start a local coordinator server (implemented by us, not in this PR) with 
> a specified port.
> 2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
> user-input Spark configurations and the local coordinator server's address 
> and port. Append an additional listener class in the configuration for 
> SparkConnectService callback with the actual address and port on Yarn to the 
> coordinator server.
> 3) Wait for the coordinator server to receive the address callback from the 
> SparkConnectService on Yarn and export the real address.
> 4) Start the client (pyspark --remote) with the remote address.
> Finally, a remote SparkConnect Server is started on Yarn with a local 
> SparkConnect client connected. Users no longer need to start the server 
> beforehand and connect to the remote server after they manually explore the 
> address on Yarn.
> 2.Problem statement of this change:
> 1) The specified port for the SparkConnectService GRPC server might be 
> occupied on the node of the Hadoop Cluster. To increase the success rate of 
> startup, it needs to retry on conflicts rather than fail directly.
> 2) Because the final binding port could be uncertain based on #1 and the 
> remote address is unpredictable on Yarn, we need to retrieve the address and 
> port programmatically and inject it automatically on the start of `pyspark 
> --remote`. The SparkConnectService needs to communicate its location back to 
> the launcher side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47965) Avoid orNull in TypedConfigBuilder

2024-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-47965:
-
Issue Type: Improvement  (was: Bug)

> Avoid orNull in TypedConfigBuilder
> --
>
> Key: SPARK-47965
> URL: https://issues.apache.org/jira/browse/SPARK-47965
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Configuration values/keys cannot be nulls. We should fix:
> {code}
> diff --git 
> a/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala 
> b/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> index 1f19e9444d38..d06535722625 100644
> --- a/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> +++ b/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> @@ -94,7 +94,7 @@ private[spark] class TypedConfigBuilder[T](
>import ConfigHelpers._
>def this(parent: ConfigBuilder, converter: String => T) = {
> -this(parent, converter, Option(_).map(_.toString).orNull)
> +this(parent, converter, { v: T => v.toString })
>}
>/** Apply a transformation to the user-provided values of the config 
> entry. */
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47965) Avoid orNull in TypedConfigBuilder and OptionalConfigEntry

2024-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-47965:
-
Summary: Avoid orNull in TypedConfigBuilder and OptionalConfigEntry  (was: 
Avoid orNull in TypedConfigBuilder)

> Avoid orNull in TypedConfigBuilder and OptionalConfigEntry
> --
>
> Key: SPARK-47965
> URL: https://issues.apache.org/jira/browse/SPARK-47965
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> Configuration values/keys cannot be nulls. We should fix:
> {code}
> diff --git 
> a/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala 
> b/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> index 1f19e9444d38..d06535722625 100644
> --- a/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> +++ b/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> @@ -94,7 +94,7 @@ private[spark] class TypedConfigBuilder[T](
>import ConfigHelpers._
>def this(parent: ConfigBuilder, converter: String => T) = {
> -this(parent, converter, Option(_).map(_.toString).orNull)
> +this(parent, converter, { v: T => v.toString })
>}
>/** Apply a transformation to the user-provided values of the config 
> entry. */
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47965) Avoid orNull in TypedConfigBuilder

2024-04-23 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-47965:
-
Priority: Minor  (was: Major)

> Avoid orNull in TypedConfigBuilder
> --
>
> Key: SPARK-47965
> URL: https://issues.apache.org/jira/browse/SPARK-47965
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Configuration values/keys cannot be nulls. We should fix:
> {code}
> diff --git 
> a/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala 
> b/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> index 1f19e9444d38..d06535722625 100644
> --- a/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> +++ b/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> @@ -94,7 +94,7 @@ private[spark] class TypedConfigBuilder[T](
>import ConfigHelpers._
>def this(parent: ConfigBuilder, converter: String => T) = {
> -this(parent, converter, Option(_).map(_.toString).orNull)
> +this(parent, converter, { v: T => v.toString })
>}
>/** Apply a transformation to the user-provided values of the config 
> entry. */
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47965) Avoid orNull in TypedConfigBuilder

2024-04-23 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-47965:


 Summary: Avoid orNull in TypedConfigBuilder
 Key: SPARK-47965
 URL: https://issues.apache.org/jira/browse/SPARK-47965
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


Configuration values/keys cannot be nulls. We should fix:

{code}
diff --git 
a/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala 
b/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
index 1f19e9444d38..d06535722625 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
@@ -94,7 +94,7 @@ private[spark] class TypedConfigBuilder[T](
   import ConfigHelpers._

   def this(parent: ConfigBuilder, converter: String => T) = {
-this(parent, converter, Option(_).map(_.toString).orNull)
+this(parent, converter, { v: T => v.toString })
   }

   /** Apply a transformation to the user-provided values of the config entry. 
*/
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47965) Avoid orNull in TypedConfigBuilder

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47965:
---
Labels: pull-request-available  (was: )

> Avoid orNull in TypedConfigBuilder
> --
>
> Key: SPARK-47965
> URL: https://issues.apache.org/jira/browse/SPARK-47965
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> Configuration values/keys cannot be nulls. We should fix:
> {code}
> diff --git 
> a/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala 
> b/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> index 1f19e9444d38..d06535722625 100644
> --- a/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> +++ b/core/src/main/scala/org/apache/spark/internal/config/ConfigBuilder.scala
> @@ -94,7 +94,7 @@ private[spark] class TypedConfigBuilder[T](
>import ConfigHelpers._
>def this(parent: ConfigBuilder, converter: String => T) = {
> -this(parent, converter, Option(_).map(_.toString).orNull)
> +this(parent, converter, { v: T => v.toString })
>}
>/** Apply a transformation to the user-provided values of the config 
> entry. */
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47946) Nested field's nullable value could be invalid after extracted using GetStructField

2024-04-23 Thread Junyoung Cho (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junyoung Cho updated SPARK-47946:
-
Description: 
I've got error when append to table using DataFrameWriterV2.

The error was occured in TableOutputResolver.checkNullability. This error 
occurs when the data type of the schema is the same, but the order of the 
fields is different.

I found that GetStructField.nullable returns unexpected result.
{code:java}
override def nullable: Boolean = child.nullable || 
childSchema(ordinal).nullable {code}
Even if nested field has not nullability attribute, it returns true when parent 
struct has nullability attribute.
||Parent nullability||Child nullability||Result||
|true|true|true|
|{color:#ff}true{color}|{color:#ff}false{color}|{color:#ff}true{color}|
|{color:#172b4d}false{color}|{color:#172b4d}true{color}|{color:#172b4d}true{color}|
|false|false|false|

 

I think the logic should be changed to get just child's nullability, because 
both of parent and child should be nullable to be considered nullable.

 
{code:java}
override def nullable: Boolean = childSchema(ordinal).nullable  {code}
 

 

 

I want to check current logic is reasonable, or my suggestion can occur other 
side effect.

  was:
I've got error when append to table using DataFrameWriterV2.

The error was occured in TableOutputResolver.checkNullability. This error 
occurs when the data type of the schema is the same, but the order of the 
fields is different.

I found that GetStructField.nullable returns unexpected result.
{code:java}
override def nullable: Boolean = child.nullable || 
childSchema(ordinal).nullable {code}
Even if nested field has not nullability attribute, it returns true when parent 
struct has nullability attribute.
||Parent nullability||Child nullability||Result||
|true|true|true|
|{color:#ff}true{color}|{color:#ff}false{color}|{color:#ff}true{color}|
|{color:#ff}false{color}|{color:#ff}true{color}|{color:#ff}true{color}|
|false|false|false|

 

I think the logic should be changed to AND operation, because both of parent 
and child should be nullable to be considered nullable.

 
{code:java}
override def nullable: Boolean = child.nullable || 
childSchema(ordinal).nullable  {code}
 

 

 

I want to check current logic is reasonable, or my suggestion can occur other 
side effect.


> Nested field's nullable value could be invalid after extracted using 
> GetStructField
> ---
>
> Key: SPARK-47946
> URL: https://issues.apache.org/jira/browse/SPARK-47946
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.2
>Reporter: Junyoung Cho
>Priority: Major
>
> I've got error when append to table using DataFrameWriterV2.
> The error was occured in TableOutputResolver.checkNullability. This error 
> occurs when the data type of the schema is the same, but the order of the 
> fields is different.
> I found that GetStructField.nullable returns unexpected result.
> {code:java}
> override def nullable: Boolean = child.nullable || 
> childSchema(ordinal).nullable {code}
> Even if nested field has not nullability attribute, it returns true when 
> parent struct has nullability attribute.
> ||Parent nullability||Child nullability||Result||
> |true|true|true|
> |{color:#ff}true{color}|{color:#ff}false{color}|{color:#ff}true{color}|
> |{color:#172b4d}false{color}|{color:#172b4d}true{color}|{color:#172b4d}true{color}|
> |false|false|false|
>  
> I think the logic should be changed to get just child's nullability, because 
> both of parent and child should be nullable to be considered nullable.
>  
> {code:java}
> override def nullable: Boolean = childSchema(ordinal).nullable  {code}
>  
>  
>  
> I want to check current logic is reasonable, or my suggestion can occur other 
> side effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47946) Nested field's nullable value could be invalid after extracted using GetStructField

2024-04-23 Thread Junyoung Cho (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junyoung Cho updated SPARK-47946:
-
Description: 
I've got error when append to table using DataFrameWriterV2.

The error was occured in TableOutputResolver.checkNullability. This error 
occurs when the data type of the schema is the same, but the order of the 
fields is different.

I found that GetStructField.nullable returns unexpected result.
{code:java}
override def nullable: Boolean = child.nullable || 
childSchema(ordinal).nullable {code}
Even if nested field has not nullability attribute, it returns true when parent 
struct has nullability attribute.
||Parent nullability||Child nullability||Result||
|true|true|true|
|{color:#ff}true{color}|{color:#ff}false{color}|{color:#ff}true{color}|
|{color:#ff}false{color}|{color:#ff}true{color}|{color:#ff}true{color}|
|false|false|false|

 

I think the logic should be changed to AND operation, because both of parent 
and child should be nullable to be considered nullable.

 
{code:java}
override def nullable: Boolean = child.nullable || 
childSchema(ordinal).nullable  {code}
 

 

 

I want to check current logic is reasonable, or my suggestion can occur other 
side effect.

  was:
I've got error when append to table using DataFrameWriterV2.

The error was occured in TableOutputResolver.checkNullability. This error 
occurs when the data type of the schema is the same, but the order of the 
fields is different.

I found that GetStructField.nullable returns unexpected result.
{code:java}
override def nullable: Boolean = child.nullable || 
childSchema(ordinal).nullable {code}
Even if nested field has not nullability attribute, it returns true when parent 
struct has nullability attribute.
||Parent nullability||Child nullability||Result||
|true|true|true|
|{color:#FF}true{color}|{color:#FF}false{color}|{color:#FF}true{color}|
|{color:#FF}false{color}|{color:#FF}true{color}|{color:#FF}true{color}|
|false|false|false|

 

I think the logic should be changed to AND operation, because both of parent 
and child should be nullable to be considered nullable.

 

I want to check current logic is reasonable, or my suggestion can occur other 
side effect.


> Nested field's nullable value could be invalid after extracted using 
> GetStructField
> ---
>
> Key: SPARK-47946
> URL: https://issues.apache.org/jira/browse/SPARK-47946
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.4.2
>Reporter: Junyoung Cho
>Priority: Major
>
> I've got error when append to table using DataFrameWriterV2.
> The error was occured in TableOutputResolver.checkNullability. This error 
> occurs when the data type of the schema is the same, but the order of the 
> fields is different.
> I found that GetStructField.nullable returns unexpected result.
> {code:java}
> override def nullable: Boolean = child.nullable || 
> childSchema(ordinal).nullable {code}
> Even if nested field has not nullability attribute, it returns true when 
> parent struct has nullability attribute.
> ||Parent nullability||Child nullability||Result||
> |true|true|true|
> |{color:#ff}true{color}|{color:#ff}false{color}|{color:#ff}true{color}|
> |{color:#ff}false{color}|{color:#ff}true{color}|{color:#ff}true{color}|
> |false|false|false|
>  
> I think the logic should be changed to AND operation, because both of parent 
> and child should be nullable to be considered nullable.
>  
> {code:java}
> override def nullable: Boolean = child.nullable || 
> childSchema(ordinal).nullable  {code}
>  
>  
>  
> I want to check current logic is reasonable, or my suggestion can occur other 
> side effect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47583) SQL core: Migrate logError with variables to structured logging framework

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47583?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47583:
---
Labels: pull-request-available  (was: )

> SQL core: Migrate logError with variables to structured logging framework
> -
>
> Key: SPARK-47583
> URL: https://issues.apache.org/jira/browse/SPARK-47583
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47597) Streaming: Migrate logInfo with variables to structured logging framework

2024-04-23 Thread Daniel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840213#comment-17840213
 ] 

Daniel edited comment on SPARK-47597 at 4/23/24 9:44 PM:
-

spark$ grep logInfo sql/core/src/main/* -R | cut -d ':' -f 1 | sort -u

Under sql/core/src/main/scala/org/apache/spark/sql/execution/streaming:

(X) AsyncProgressTrackingMicroBatchExecution.scala
(X) CheckpointFileManager.scala
(X) CompactibleFileStreamLog.scala
(X) FileStreamSink.scala
(X) FileStreamSinkLog.scala
(X) FileStreamSource.scala
(X) HDFSMetadataLog.scala
(X) IncrementalExecution.scala
(X) ManifestFileCommitProtocol.scala
(X) MetadataLogFileIndex.scala
(X) MicroBatchExecution.scala
(X) ProgressReporter.scala
(X) ResolveWriteToStream.scala
(X) StreamExecution.scala
(X) WatermarkTracker.scala
(X) continuous/ContinuousExecution.scala
(X) continuous/ContinuousQueuedDataReader.scala
(X) continuous/ContinuousWriteRDD.scala
(X) continuous/EpochCoordinator.scala
(X) continuous/WriteToContinuousDataSourceExec.scala
(X) sources/RateStreamMicroBatchStream.scala
(X) state/HDFSBackedStateStoreProvider.scala
(X) state/RocksDB.scala
(X) state/RocksDBFileManager.scala
(X) state/RocksDBLoader.scala
(X) state/RocksDBMemoryManager.scala
(X) state/RocksDBStateStoreProvider.scala
(X) state/StateSchemaCompatibilityChecker.scala
(X) state/StateStore.scala
(X) state/StateStoreChangelog.scala
(X) state/StateStoreCoordinator.scala
(X) state/StreamingSessionWindowStateManager.scala
(X) state/SymmetricHashJoinStateManager.scala


was (Author: JIRAUSER285772):
spark$ grep logInfo sql/core/src/main/* -R | cut -d ':' -f 1 | sort -u

Under sql/core/src/main/scala/org/apache/spark/sql/execution/streaming:

(X) AsyncProgressTrackingMicroBatchExecution.scala
(X) CheckpointFileManager.scala
(X) CompactibleFileStreamLog.scala
(X) FileStreamSink.scala
(X) FileStreamSinkLog.scala
(X) FileStreamSource.scala
(X) HDFSMetadataLog.scala
(X) IncrementalExecution.scala
(X) ManifestFileCommitProtocol.scala
(X) MetadataLogFileIndex.scala
(X) MicroBatchExecution.scala
(X) ProgressReporter.scala
(X) ResolveWriteToStream.scala
(X) StreamExecution.scala
(X) WatermarkTracker.scala
(X) continuous/ContinuousExecution.scala
(X) continuous/ContinuousQueuedDataReader.scala
(X) continuous/ContinuousWriteRDD.scala
(X) continuous/EpochCoordinator.scala
(X) continuous/WriteToContinuousDataSourceExec.scala
(X) sources/RateStreamMicroBatchStream.scala
(X) state/HDFSBackedStateStoreProvider.scala
(X) state/RocksDB.scala
state/RocksDBFileManager.scala
state/RocksDBLoader.scala
state/RocksDBMemoryManager.scala
state/RocksDBStateStoreProvider.scala
state/StateSchemaCompatibilityChecker.scala
state/StateStore.scala
state/StateStoreChangelog.scala
state/StateStoreCoordinator.scala
state/StreamingSessionWindowStateManager.scala
state/SymmetricHashJoinStateManager.scala

 

> Streaming: Migrate logInfo with variables to structured logging framework
> -
>
> Key: SPARK-47597
> URL: https://issues.apache.org/jira/browse/SPARK-47597
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47597) Streaming: Migrate logInfo with variables to structured logging framework

2024-04-23 Thread Daniel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840213#comment-17840213
 ] 

Daniel edited comment on SPARK-47597 at 4/23/24 9:14 PM:
-

spark$ grep logInfo sql/core/src/main/* -R | cut -d ':' -f 1 | sort -u

Under sql/core/src/main/scala/org/apache/spark/sql/execution/streaming:

(X) AsyncProgressTrackingMicroBatchExecution.scala
(X) CheckpointFileManager.scala
(X) CompactibleFileStreamLog.scala
(X) FileStreamSink.scala
(X) FileStreamSinkLog.scala
(X) FileStreamSource.scala
(X) HDFSMetadataLog.scala
(X) IncrementalExecution.scala
(X) ManifestFileCommitProtocol.scala
(X) MetadataLogFileIndex.scala
(X) MicroBatchExecution.scala
(X) ProgressReporter.scala
(X) ResolveWriteToStream.scala
(X) StreamExecution.scala
(X) WatermarkTracker.scala
(X) continuous/ContinuousExecution.scala
(X) continuous/ContinuousQueuedDataReader.scala
(X) continuous/ContinuousWriteRDD.scala
(X) continuous/EpochCoordinator.scala
(X) continuous/WriteToContinuousDataSourceExec.scala
(X) sources/RateStreamMicroBatchStream.scala
(X) state/HDFSBackedStateStoreProvider.scala
(X) state/RocksDB.scala
state/RocksDBFileManager.scala
state/RocksDBLoader.scala
state/RocksDBMemoryManager.scala
state/RocksDBStateStoreProvider.scala
state/StateSchemaCompatibilityChecker.scala
state/StateStore.scala
state/StateStoreChangelog.scala
state/StateStoreCoordinator.scala
state/StreamingSessionWindowStateManager.scala
state/SymmetricHashJoinStateManager.scala

 


was (Author: JIRAUSER285772):
spark$ grep logInfo sql/core/src/main/* -R | cut -d ':' -f 1 | sort -u

Under sql/core/src/main/scala/org/apache/spark/sql/execution/streaming:

(X) AsyncProgressTrackingMicroBatchExecution.scala
(X) CheckpointFileManager.scala
(X) CompactibleFileStreamLog.scala
(X) FileStreamSink.scala
(X) FileStreamSinkLog.scala
(X) FileStreamSource.scala
(X) HDFSMetadataLog.scala
(X) IncrementalExecution.scala
(X) ManifestFileCommitProtocol.scala
(X) MetadataLogFileIndex.scala
(X) MicroBatchExecution.scala
(X) ProgressReporter.scala
(X) ResolveWriteToStream.scala
(X) StreamExecution.scala
(X) WatermarkTracker.scala
(X) continuous/ContinuousExecution.scala
(X) continuous/ContinuousQueuedDataReader.scala
(X) continuous/ContinuousWriteRDD.scala
(X) continuous/EpochCoordinator.scala
(X) continuous/WriteToContinuousDataSourceExec.scala
(X) sources/RateStreamMicroBatchStream.scala
(X) state/HDFSBackedStateStoreProvider.scala
state/RocksDB.scala
state/RocksDBFileManager.scala
state/RocksDBLoader.scala
state/RocksDBMemoryManager.scala
state/RocksDBStateStoreProvider.scala
state/StateSchemaCompatibilityChecker.scala
state/StateStore.scala
state/StateStoreChangelog.scala
state/StateStoreCoordinator.scala
state/StreamingSessionWindowStateManager.scala
state/SymmetricHashJoinStateManager.scala

 

> Streaming: Migrate logInfo with variables to structured logging framework
> -
>
> Key: SPARK-47597
> URL: https://issues.apache.org/jira/browse/SPARK-47597
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47597) Streaming: Migrate logInfo with variables to structured logging framework

2024-04-23 Thread Daniel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840213#comment-17840213
 ] 

Daniel edited comment on SPARK-47597 at 4/23/24 9:04 PM:
-

spark$ grep logInfo sql/core/src/main/* -R | cut -d ':' -f 1 | sort -u

Under sql/core/src/main/scala/org/apache/spark/sql/execution/streaming:

(X) AsyncProgressTrackingMicroBatchExecution.scala
(X) CheckpointFileManager.scala
(X) CompactibleFileStreamLog.scala
(X) FileStreamSink.scala
(X) FileStreamSinkLog.scala
(X) FileStreamSource.scala
(X) HDFSMetadataLog.scala
(X) IncrementalExecution.scala
(X) ManifestFileCommitProtocol.scala
(X) MetadataLogFileIndex.scala
(X) MicroBatchExecution.scala
(X) ProgressReporter.scala
(X) ResolveWriteToStream.scala
(X) StreamExecution.scala
(X) WatermarkTracker.scala
(X) continuous/ContinuousExecution.scala
(X) continuous/ContinuousQueuedDataReader.scala
(X) continuous/ContinuousWriteRDD.scala
(X) continuous/EpochCoordinator.scala
(X) continuous/WriteToContinuousDataSourceExec.scala
(X) sources/RateStreamMicroBatchStream.scala
(X) state/HDFSBackedStateStoreProvider.scala
state/RocksDB.scala
state/RocksDBFileManager.scala
state/RocksDBLoader.scala
state/RocksDBMemoryManager.scala
state/RocksDBStateStoreProvider.scala
state/StateSchemaCompatibilityChecker.scala
state/StateStore.scala
state/StateStoreChangelog.scala
state/StateStoreCoordinator.scala
state/StreamingSessionWindowStateManager.scala
state/SymmetricHashJoinStateManager.scala

 


was (Author: JIRAUSER285772):
spark$ grep logInfo sql/core/src/main/* -R | cut -d ':' -f 1 | sort -u

Under sql/core/src/main/scala/org/apache/spark/sql/execution/streaming:

(X) AsyncProgressTrackingMicroBatchExecution.scala
(X) CheckpointFileManager.scala
(X) CompactibleFileStreamLog.scala
(X) FileStreamSink.scala
(X) FileStreamSinkLog.scala
(X) FileStreamSource.scala
(X) HDFSMetadataLog.scala
(X) IncrementalExecution.scala
(X) ManifestFileCommitProtocol.scala
(X) MetadataLogFileIndex.scala
(X) MicroBatchExecution.scala
(X) ProgressReporter.scala
(X) ResolveWriteToStream.scala
(X) StreamExecution.scala
(X) WatermarkTracker.scala
continuous/ContinuousExecution.scala
continuous/ContinuousQueuedDataReader.scala
continuous/ContinuousWriteRDD.scala
continuous/EpochCoordinator.scala
continuous/WriteToContinuousDataSourceExec.scala
sources/RateStreamMicroBatchStream.scala
state/HDFSBackedStateStoreProvider.scala
state/RocksDB.scala
state/RocksDBFileManager.scala
state/RocksDBLoader.scala
state/RocksDBMemoryManager.scala
state/RocksDBStateStoreProvider.scala
state/StateSchemaCompatibilityChecker.scala
state/StateStore.scala
state/StateStoreChangelog.scala
state/StateStoreCoordinator.scala
state/StreamingSessionWindowStateManager.scala
state/SymmetricHashJoinStateManager.scala

 

> Streaming: Migrate logInfo with variables to structured logging framework
> -
>
> Key: SPARK-47597
> URL: https://issues.apache.org/jira/browse/SPARK-47597
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47597) Streaming: Migrate logInfo with variables to structured logging framework

2024-04-23 Thread Daniel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840213#comment-17840213
 ] 

Daniel edited comment on SPARK-47597 at 4/23/24 8:31 PM:
-

spark$ grep logInfo sql/core/src/main/* -R | cut -d ':' -f 1 | sort -u

Under sql/core/src/main/scala/org/apache/spark/sql/execution/streaming:

(X) AsyncProgressTrackingMicroBatchExecution.scala
(X) CheckpointFileManager.scala
(X) CompactibleFileStreamLog.scala
(X) FileStreamSink.scala
(X) FileStreamSinkLog.scala
(X) FileStreamSource.scala
(X) HDFSMetadataLog.scala
(X) IncrementalExecution.scala
(X) ManifestFileCommitProtocol.scala
(X) MetadataLogFileIndex.scala
(X) MicroBatchExecution.scala
(X) ProgressReporter.scala
(X) ResolveWriteToStream.scala
(X) StreamExecution.scala
(X) WatermarkTracker.scala
continuous/ContinuousExecution.scala
continuous/ContinuousQueuedDataReader.scala
continuous/ContinuousWriteRDD.scala
continuous/EpochCoordinator.scala
continuous/WriteToContinuousDataSourceExec.scala
sources/RateStreamMicroBatchStream.scala
state/HDFSBackedStateStoreProvider.scala
state/RocksDB.scala
state/RocksDBFileManager.scala
state/RocksDBLoader.scala
state/RocksDBMemoryManager.scala
state/RocksDBStateStoreProvider.scala
state/StateSchemaCompatibilityChecker.scala
state/StateStore.scala
state/StateStoreChangelog.scala
state/StateStoreCoordinator.scala
state/StreamingSessionWindowStateManager.scala
state/SymmetricHashJoinStateManager.scala

 


was (Author: JIRAUSER285772):
spark$ grep logInfo sql/core/src/main/* -R | cut -d ':' -f 1 | sort -u

Under sql/core/src/main/scala/org/apache/spark/sql/execution/streaming:

(X) AsyncProgressTrackingMicroBatchExecution.scala
(X) CheckpointFileManager.scala
(X) CompactibleFileStreamLog.scala
(X) FileStreamSink.scala
(X) FileStreamSinkLog.scala
(X) FileStreamSource.scala
(X) HDFSMetadataLog.scala
(X) IncrementalExecution.scala
(X) ManifestFileCommitProtocol.scala
(X) MetadataLogFileIndex.scala
(X) MicroBatchExecution.scala
(X) ProgressReporter.scala
(X) ResolveWriteToStream.scala
StreamExecution.scala
WatermarkTracker.scala
continuous/ContinuousExecution.scala
continuous/ContinuousQueuedDataReader.scala
continuous/ContinuousWriteRDD.scala
continuous/EpochCoordinator.scala
continuous/WriteToContinuousDataSourceExec.scala
sources/RateStreamMicroBatchStream.scala
state/HDFSBackedStateStoreProvider.scala
state/RocksDB.scala
state/RocksDBFileManager.scala
state/RocksDBLoader.scala
state/RocksDBMemoryManager.scala
state/RocksDBStateStoreProvider.scala
state/StateSchemaCompatibilityChecker.scala
state/StateStore.scala
state/StateStoreChangelog.scala
state/StateStoreCoordinator.scala
state/StreamingSessionWindowStateManager.scala
state/SymmetricHashJoinStateManager.scala

 

> Streaming: Migrate logInfo with variables to structured logging framework
> -
>
> Key: SPARK-47597
> URL: https://issues.apache.org/jira/browse/SPARK-47597
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-47597) Streaming: Migrate logInfo with variables to structured logging framework

2024-04-23 Thread Daniel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840213#comment-17840213
 ] 

Daniel edited comment on SPARK-47597 at 4/23/24 8:27 PM:
-

spark$ grep logInfo sql/core/src/main/* -R | cut -d ':' -f 1 | sort -u

Under sql/core/src/main/scala/org/apache/spark/sql/execution/streaming:

(X) AsyncProgressTrackingMicroBatchExecution.scala
(X) CheckpointFileManager.scala
(X) CompactibleFileStreamLog.scala
(X) FileStreamSink.scala
(X) FileStreamSinkLog.scala
(X) FileStreamSource.scala
(X) HDFSMetadataLog.scala
(X) IncrementalExecution.scala
(X) ManifestFileCommitProtocol.scala
(X) MetadataLogFileIndex.scala
(X) MicroBatchExecution.scala
(X) ProgressReporter.scala
(X) ResolveWriteToStream.scala
StreamExecution.scala
WatermarkTracker.scala
continuous/ContinuousExecution.scala
continuous/ContinuousQueuedDataReader.scala
continuous/ContinuousWriteRDD.scala
continuous/EpochCoordinator.scala
continuous/WriteToContinuousDataSourceExec.scala
sources/RateStreamMicroBatchStream.scala
state/HDFSBackedStateStoreProvider.scala
state/RocksDB.scala
state/RocksDBFileManager.scala
state/RocksDBLoader.scala
state/RocksDBMemoryManager.scala
state/RocksDBStateStoreProvider.scala
state/StateSchemaCompatibilityChecker.scala
state/StateStore.scala
state/StateStoreChangelog.scala
state/StateStoreCoordinator.scala
state/StreamingSessionWindowStateManager.scala
state/SymmetricHashJoinStateManager.scala

 


was (Author: JIRAUSER285772):
spark$ grep logInfo sql/core/src/main/* -R | cut -d ':' -f 1 | sort -u

Under sql/core/src/main/scala/org/apache/spark/sql/execution/streaming:

AsyncProgressTrackingMicroBatchExecution.scala
CheckpointFileManager.scala
CompactibleFileStreamLog.scala
FileStreamSink.scala
FileStreamSinkLog.scala
FileStreamSource.scala
HDFSMetadataLog.scala
IncrementalExecution.scala
ManifestFileCommitProtocol.scala
MetadataLogFileIndex.scala
MicroBatchExecution.scala
ProgressReporter.scala
ResolveWriteToStream.scala
StreamExecution.scala
WatermarkTracker.scala
continuous/ContinuousExecution.scala
continuous/ContinuousQueuedDataReader.scala
continuous/ContinuousWriteRDD.scala
continuous/EpochCoordinator.scala
continuous/WriteToContinuousDataSourceExec.scala
sources/RateStreamMicroBatchStream.scala
state/HDFSBackedStateStoreProvider.scala
state/RocksDB.scala
state/RocksDBFileManager.scala
state/RocksDBLoader.scala
state/RocksDBMemoryManager.scala
state/RocksDBStateStoreProvider.scala
state/StateSchemaCompatibilityChecker.scala
state/StateStore.scala
state/StateStoreChangelog.scala
state/StateStoreCoordinator.scala
state/StreamingSessionWindowStateManager.scala
state/SymmetricHashJoinStateManager.scala

 

> Streaming: Migrate logInfo with variables to structured logging framework
> -
>
> Key: SPARK-47597
> URL: https://issues.apache.org/jira/browse/SPARK-47597
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47597) Streaming: Migrate logInfo with variables to structured logging framework

2024-04-23 Thread Daniel (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17840213#comment-17840213
 ] 

Daniel commented on SPARK-47597:


spark$ grep logInfo sql/core/src/main/* -R | cut -d ':' -f 1 | sort -u

Under sql/core/src/main/scala/org/apache/spark/sql/execution/streaming:

AsyncProgressTrackingMicroBatchExecution.scala
CheckpointFileManager.scala
CompactibleFileStreamLog.scala
FileStreamSink.scala
FileStreamSinkLog.scala
FileStreamSource.scala
HDFSMetadataLog.scala
IncrementalExecution.scala
ManifestFileCommitProtocol.scala
MetadataLogFileIndex.scala
MicroBatchExecution.scala
ProgressReporter.scala
ResolveWriteToStream.scala
StreamExecution.scala
WatermarkTracker.scala
continuous/ContinuousExecution.scala
continuous/ContinuousQueuedDataReader.scala
continuous/ContinuousWriteRDD.scala
continuous/EpochCoordinator.scala
continuous/WriteToContinuousDataSourceExec.scala
sources/RateStreamMicroBatchStream.scala
state/HDFSBackedStateStoreProvider.scala
state/RocksDB.scala
state/RocksDBFileManager.scala
state/RocksDBLoader.scala
state/RocksDBMemoryManager.scala
state/RocksDBStateStoreProvider.scala
state/StateSchemaCompatibilityChecker.scala
state/StateStore.scala
state/StateStoreChangelog.scala
state/StateStoreCoordinator.scala
state/StreamingSessionWindowStateManager.scala
state/SymmetricHashJoinStateManager.scala

 

> Streaming: Migrate logInfo with variables to structured logging framework
> -
>
> Key: SPARK-47597
> URL: https://issues.apache.org/jira/browse/SPARK-47597
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47604) Resource managers: Migrate logInfo with variables to structured logging framework

2024-04-23 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-47604.

Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46130
[https://github.com/apache/spark/pull/46130]

> Resource managers: Migrate logInfo with variables to structured logging 
> framework
> -
>
> Key: SPARK-47604
> URL: https://issues.apache.org/jira/browse/SPARK-47604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Gengliang Wang
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47960) Support Chaining Stateful Operators in TransformWithState

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47960:
---
Labels: pull-request-available  (was: )

> Support Chaining Stateful Operators in TransformWithState
> -
>
> Key: SPARK-47960
> URL: https://issues.apache.org/jira/browse/SPARK-47960
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Bhuwan Sahni
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> This issue tracks adding support to chain stateful operators after the 
> Arbitrary State API, transformWithState. In order to support chaining, we 
> need to allow the user to specify the new eventTimeColumn in the output from 
> StatefulProcessor. Any watermark evaluation expressions downstream after 
> transformWithState would use the user specified eventTimeColumn.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47961) CREATE TABLE AS SELECT changes behaviour in SPARK 3.4.0

2024-04-23 Thread Eugen Stoianovici (Jira)
Eugen Stoianovici created SPARK-47961:
-

 Summary: CREATE TABLE AS SELECT changes behaviour in SPARK 3.4.0
 Key: SPARK-47961
 URL: https://issues.apache.org/jira/browse/SPARK-47961
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Eugen Stoianovici


SPARK-41859 changes the behaviour for `CREATE TABLE AS SELECT ...` from 
OVERWRITE to APPEND when {{spark.sql.legacy.allowNonEmptyLocationInCTAS}} is 
set to {{{}true{}}}:

{{drop table if exists test_table;}}
{{create table test_table location '/tmp/test_table' stored as parquet as 
select 1 as col union all select 2 as col;}}
{{drop table if exists test_table;}}
{{create table test_table location '/tmp/test_table' stored as parquet as 
select 3 as col union all select 4 as col;}}
{{select * from test_table;}}

This produces {3, 4} in Spark <3.4.0 and {1, 2, 3, 4} in Spark 3.4.0 and later. 
This is a silent change in {{spark.sql.legacy.allowNonEmptyLocationInCTAS}} 
behaviour which introduces wrong results in the user application

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47960) Support Chaining Stateful Operators in TransformWithState

2024-04-23 Thread Bhuwan Sahni (Jira)
Bhuwan Sahni created SPARK-47960:


 Summary: Support Chaining Stateful Operators in TransformWithState
 Key: SPARK-47960
 URL: https://issues.apache.org/jira/browse/SPARK-47960
 Project: Spark
  Issue Type: New Feature
  Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Bhuwan Sahni
 Fix For: 4.0.0


This issue tracks adding support to chain stateful operators after the 
Arbitrary State API, transformWithState. In order to support chaining, we need 
to allow the user to specify the new eventTimeColumn in the output from 
StatefulProcessor. Any watermark evaluation expressions downstream after 
transformWithState would use the user specified eventTimeColumn.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47959) Improve GET_JSON_OBJECT performance on executors running multiple tasks

2024-04-23 Thread Zheng Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated SPARK-47959:
---
Description: 
We have a Spark executor that is running 32 workers in parallel.  The query is 
a simple SELECT with several `GET_JSON_OBJECT` UDF calls.

We noticed that 80+% of the stacktrace of the worker threads are blocked on the 
following stacktrace:

 
{code:java}
com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) - 
blocked on java.lang.Object@7529fde1 
com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
 
com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown
 Source)
...
{code}
 

Apparently jackson-core has such a performance bug from version 2.3 - 2.15, and 
not fixed until version 2.18 (unreleased): 
[https://github.com/FasterXML/jackson-core/blob/fc51d1e13f4ba62a25a739f26be9e05aaad88c3e/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L50]
 
{code:java}
            synchronized (lock) {
                if (size() >= MAX_ENTRIES) {
                    clear();
                }
            }
{code}
 
instead of 
[https://github.com/FasterXML/jackson-core/blob/8b87cc1a96f649a7e7872c5baa8cf97909cabf6b/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L59]
 
{code:java}
            /* As of 2.18, the limit is not strictly enforced, but we do try to
             * clear entries if we have reached the limit. We do not expect to
             * go too much over the limit, and if we do, it's not a huge 
problem.
             * If some other thread has the lock, we will not clear but the 
lock should
             * not be held for long, so another thread should be able to clear 
in the near future.
             */
            if (lock.tryLock()) {
                try {
                    if (size() >= DEFAULT_MAX_ENTRIES) {
                        clear();
                    }
                } finally {
                    lock.unlock();
                }
            }   {code}
 

Potential fixes:
 # Upgrade to Jackson-core 2.18 when it's released;
 # Follow [https://github.com/FasterXML/jackson-core/issues/998] - I don't 
totally understand the options suggested by this thread yet.

  was:
We have a Spark executor that is running 32 workers in parallel.  The query is 
a simple SELECT with several `GET_JSON_OBJECT` UDF calls.

We noticed that 80+% of the stacktrace of the worker threads are blocked on the 
following stacktrace:

 
{code:java}
com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) - 
blocked on java.lang.Object@7529fde1 
com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
 
com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown
 Source)
...
{code}
 

Apparently jackson-core has such a 

[jira] [Updated] (SPARK-47959) Improve GET_JSON_OBJECT performance on executors running multiple tasks

2024-04-23 Thread Zheng Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated SPARK-47959:
---
Description: 
We have a Spark executor that is running 32 workers in parallel.  The query is 
a simple SELECT with several `GET_JSON_OBJECT` UDF calls.

We noticed that 80+% of the stacktrace of the worker threads are blocked on the 
following stacktrace:

 
{code:java}
com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) - 
blocked on java.lang.Object@7529fde1 
com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
 
com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown
 Source)
...
{code}
 

Apparently jackson-core has such a performance bug from version 2.3 - 2.15, and 
not fixed until version 2.18 (unreleased): 
[https://github.com/FasterXML/jackson-core/blob/fc51d1e13f4ba62a25a739f26be9e05aaad88c3e/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L50]
 
{code:java}
            synchronized (lock) {
                if (size() >= MAX_ENTRIES) {
                    clear();
                }
            }
{code}
 
instead of 
[https://github.com/FasterXML/jackson-core/blob/8b87cc1a96f649a7e7872c5baa8cf97909cabf6b/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L59]
 
{code:java}
            /* As of 2.18, the limit is not strictly enforced, but we do try to
             * clear entries if we have reached the limit. We do not expect to
             * go too much over the limit, and if we do, it's not a huge 
problem.
             * If some other thread has the lock, we will not clear but the 
lock should
             * not be held for long, so another thread should be able to clear 
in the near future.
             */
            if (lock.tryLock()) {
                try {
                    if (size() >= DEFAULT_MAX_ENTRIES) {
                        clear();
                    }
                } finally {
                    lock.unlock();
                }
            }   {code}
 

  was:
We have a Spark executor that is running 32 workers in parallel.  The query is 
a simple SELECT with several `GET_JSON_OBJECT` UDF calls.

We noticed that 80+% of the stacktrace of the worker threads are blocked on the 
following stacktrace:

 
{code:java}
com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) - 
blocked on java.lang.Object@7529fde1 
com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
 
com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown
 Source)
...
{code}
 

Apparently jackson-core has such a performance bug from version 2.3 - 2.15, and 
not fixed until version 2.18 (unreleased): 

[jira] [Updated] (SPARK-47959) Improve GET_JSON_OBJECT performance on executors running multiple tasks

2024-04-23 Thread Zheng Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated SPARK-47959:
---
Description: 
We have a Spark executor that is running 32 workers in parallel.  The query is 
a simple SELECT with several `GET_JSON_OBJECT` UDF calls.

We noticed that 80+% of the stacktrace of the worker threads are blocked on the 
following stacktrace:

 
{code:java}
com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) - 
blocked on java.lang.Object@7529fde1 
com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
 
com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown
 Source)
...
{code}
 

Apparently jackson-core has such a performance bug from version 2.3 - 2.15, and 
not fixed until version 2.18 (unreleased): 
[https://github.com/FasterXML/jackson-core/blob/fc51d1e13f4ba62a25a739f26be9e05aaad88c3e/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L50]
 


{code:java}
synchronized (lock) {
if (size() >= MAX_ENTRIES) {
clear();
}
}
{code}
 
instead of 
[https://github.com/FasterXML/jackson-core/blob/8b87cc1a96f649a7e7872c5baa8cf97909cabf6b/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L59]
 
{code:java}
        /* 18-Sep-2013, tatu: We used to use LinkedHashMap, which has simple LRU
         *   method. No such functionality exists with CHM; and let's use 
simplest
         *   possible limitation: just clear all contents. This because 
otherwise
         *   we are simply likely to keep on clearing same, commonly used 
entries.
         */
        if (size() >= MAX_ENTRIES) {
            /* Not incorrect wrt well-known double-locking anti-pattern because 
underlying
             * storage gives close enough answer to real one here; and we are
             * more concerned with flooding than starvation.
             */
            synchronized (lock) {
                if (size() >= MAX_ENTRIES) {
                    clear();
                }
            }
        }
 
{code}
 

  was:
We have a Spark executor that is running 32 workers in parallel.  The query is 
a simple SELECT with several `GET_JSON_OBJECT` UDF calls.

We noticed that 80+% of the stacktrace of the worker threads are blocked on the 
following stacktrace:

 
{code:java}
com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) - 
blocked on java.lang.Object@7529fde1 
com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
 
com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown
 Source)
...
{code}
 

Apparently jackson-core has such a performance bug from version 2.3 - 2.15, and 
not fixed until version 2.18 (unreleased):

[jira] [Updated] (SPARK-47959) Improve GET_JSON_OBJECT performance on executors running multiple tasks

2024-04-23 Thread Zheng Shao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated SPARK-47959:
---
Description: 
We have a Spark executor that is running 32 workers in parallel.  The query is 
a simple SELECT with several `GET_JSON_OBJECT` UDF calls.

We noticed that 80+% of the stacktrace of the worker threads are blocked on the 
following stacktrace:

 
{code:java}
com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) - 
blocked on java.lang.Object@7529fde1 
com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
 
com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown
 Source)
...
{code}
 

Apparently jackson-core has such a performance bug from version 2.3 - 2.15, and 
not fixed until version 2.18 (unreleased):
[https://github.com/FasterXML/jackson-core/blob/fc51d1e13f4ba62a25a739f26be9e05aaad[…]/src/main/java/com/fasterxml/jackson/core/util/InternCache.java|https://github.com/FasterXML/jackson-core/blob/fc51d1e13f4ba62a25a739f26be9e05aaad88c3e/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L50]
{code:java}
synchronized (lock) {
if (size() >= MAX_ENTRIES) {
clear();
}
}
{code}

 
instead of 
[https://github.com/FasterXML/jackson-core/blob/8b87cc1a96f649a7e7872c5baa8cf97909c[…]/src/main/java/com/fasterxml/jackson/core/util/InternCache.java|https://github.com/FasterXML/jackson-core/blob/8b87cc1a96f649a7e7872c5baa8cf97909cabf6b/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L59]
/* As of 2.18, the limit is not strictly enforced, but we do try to
 * clear entries if we have reached the limit. We do not expect to
 * go too much over the limit, and if we do, it's not a huge problem.
 * If some other thread has the lock, we will not clear but the lock should
 * not be held for long, so another thread should be able to clear in the near 
future.
*/
if (lock.tryLock()) {
try
Unknown macro: \{ if (size() >= DEFAULT_MAX_ENTRIES) { clear(); } }
finally
{ lock.unlock(); }
}
 

  was:
We have a Spark executor that is running 32 workers in parallel.  The query is 
a simple SELECT with several `GET_JSON_OBJECT` UDF calls.

We noticed that 80+% of the stacktrace of the worker threads are blocked on the 
following stacktrace:

```

com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) - 
blocked on java.lang.Object@7529fde1 
com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
 
com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown
 Source) ...

```

 

Apparently jackson-core has such a performance bug from version 2.3 - 2.15, and 
not fixed until version 2.18 (unreleased):

[jira] [Created] (SPARK-47959) Improve GET_JSON_OBJECT performance on executors running multiple tasks

2024-04-23 Thread Zheng Shao (Jira)
Zheng Shao created SPARK-47959:
--

 Summary: Improve GET_JSON_OBJECT performance on executors running 
multiple tasks
 Key: SPARK-47959
 URL: https://issues.apache.org/jira/browse/SPARK-47959
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.5.1
Reporter: Zheng Shao


We have a Spark executor that is running 32 workers in parallel.  The query is 
a simple SELECT with several `GET_JSON_OBJECT` UDF calls.

We noticed that 80+% of the stacktrace of the worker threads are blocked on the 
following stacktrace:

```

com.fasterxml.jackson.core.util.InternCache.intern(InternCache.java:50) - 
blocked on java.lang.Object@7529fde1 
com.fasterxml.jackson.core.sym.ByteQuadsCanonicalizer.addName(ByteQuadsCanonicalizer.java:947)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.addName(UTF8StreamJsonParser.java:2482)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.findName(UTF8StreamJsonParser.java:2339)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.parseMediumName(UTF8StreamJsonParser.java:1870)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser._parseName(UTF8StreamJsonParser.java:1825)
 
com.fasterxml.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:798)
 
com.fasterxml.jackson.core.base.ParserMinimalBase.skipChildren(ParserMinimalBase.java:240)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:383)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.evaluatePath(jsonExpressions.scala:287)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4(jsonExpressions.scala:198)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase.$anonfun$eval$4$adapted(jsonExpressions.scala:196)
 
org.apache.spark.sql.catalyst.expressions.GetJsonObjectBase$$Lambda$8585/1316745697.apply(Unknown
 Source) ...

```

 

Apparently jackson-core has such a performance bug from version 2.3 - 2.15, and 
not fixed until version 2.18 (unreleased):
[https://github.com/FasterXML/jackson-core/blob/fc51d1e13f4ba62a25a739f26be9e05aaad[…]/src/main/java/com/fasterxml/jackson/core/util/InternCache.java|https://github.com/FasterXML/jackson-core/blob/fc51d1e13f4ba62a25a739f26be9e05aaad88c3e/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L50]
synchronized (lock) {
if (size() >= MAX_ENTRIES) {
clear(); 
}
}
 
instead of 
[https://github.com/FasterXML/jackson-core/blob/8b87cc1a96f649a7e7872c5baa8cf97909c[…]/src/main/java/com/fasterxml/jackson/core/util/InternCache.java|https://github.com/FasterXML/jackson-core/blob/8b87cc1a96f649a7e7872c5baa8cf97909cabf6b/src/main/java/com/fasterxml/jackson/core/util/InternCache.java#L59]
/* As of 2.18, the limit is not strictly enforced, but we do try to
 * clear entries if we have reached the limit. We do not expect to
 * go too much over the limit, and if we do, it's not a huge problem.
 * If some other thread has the lock, we will not clear but the lock should
 * not be held for long, so another thread should be able to clear in the near 
future.
*/
if (lock.tryLock()) {
try {
if (size() >= DEFAULT_MAX_ENTRIES) {
clear();
}
} finally {
lock.unlock();
}
}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47958) Task Scheduler may not know about executor when using LocalSchedulerBackend

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47958:
---
Labels: pull-request-available  (was: )

> Task Scheduler may not know about executor when using LocalSchedulerBackend
> ---
>
> Key: SPARK-47958
> URL: https://issues.apache.org/jira/browse/SPARK-47958
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Davin Tjong
>Priority: Major
>  Labels: pull-request-available
>
> When using LocalSchedulerBackend, the task scheduler will not know about the 
> executor until a task is run, which can lead to unexpected behavior in tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47958) Task Scheduler may not know about executor when using LocalSchedulerBackend

2024-04-23 Thread Davin Tjong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davin Tjong updated SPARK-47958:

Component/s: Tests
 (was: Spark Core)

> Task Scheduler may not know about executor when using LocalSchedulerBackend
> ---
>
> Key: SPARK-47958
> URL: https://issues.apache.org/jira/browse/SPARK-47958
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 4.0.0
>Reporter: Davin Tjong
>Priority: Major
>
> When using LocalSchedulerBackend, the task scheduler will not know about the 
> executor until a task is run, which can lead to unexpected behavior in tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47958) Task Scheduler may not know about executor when using LocalSchedulerBackend

2024-04-23 Thread Davin Tjong (Jira)
Davin Tjong created SPARK-47958:
---

 Summary: Task Scheduler may not know about executor when using 
LocalSchedulerBackend
 Key: SPARK-47958
 URL: https://issues.apache.org/jira/browse/SPARK-47958
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Davin Tjong


When using LocalSchedulerBackend, the task scheduler will not know about the 
executor until a task is run, which can lead to unexpected behavior in tests.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47956) sanity check for unresolved LCA reference

2024-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47956.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46185
[https://github.com/apache/spark/pull/46185]

> sanity check for unresolved LCA reference
> -
>
> Key: SPARK-47956
> URL: https://issues.apache.org/jira/browse/SPARK-47956
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47956) sanity check for unresolved LCA reference

2024-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47956:
-

Assignee: Wenchen Fan

> sanity check for unresolved LCA reference
> -
>
> Key: SPARK-47956
> URL: https://issues.apache.org/jira/browse/SPARK-47956
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47948) Upgrade the minimum Pandas version to 2.0.0

2024-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47948:
-

Assignee: Haejoon Lee

> Upgrade the minimum Pandas version to 2.0.0
> ---
>
> Key: SPARK-47948
> URL: https://issues.apache.org/jira/browse/SPARK-47948
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Bump up the minimum version of Pandas from 1.4.4 to 2.0.0 to support Pandas 
> API on Spark from Apache Spark 4.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47948) Upgrade the minimum Pandas version to 2.0.0

2024-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47948.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46175
[https://github.com/apache/spark/pull/46175]

> Upgrade the minimum Pandas version to 2.0.0
> ---
>
> Key: SPARK-47948
> URL: https://issues.apache.org/jira/browse/SPARK-47948
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Bump up the minimum version of Pandas from 1.4.4 to 2.0.0 to support Pandas 
> API on Spark from Apache Spark 4.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47957) pyspark.pandas.read_excel can't get _metadata because it causes a "max iterations reached for batch Resolution" error

2024-04-23 Thread Christos Karras (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christos Karras updated SPARK-47957:

Priority: Minor  (was: Major)

> pyspark.pandas.read_excel can't get _metadata because it causes a "max 
> iterations reached for batch Resolution" error
> -
>
> Key: SPARK-47957
> URL: https://issues.apache.org/jira/browse/SPARK-47957
> Project: Spark
>  Issue Type: Bug
>  Components: Pandas API on Spark
>Affects Versions: 3.5.1
>Reporter: Christos Karras
>Priority: Minor
>
> I'm trying to add _metadata.file_path to a Spark DataFrame that was read from 
> Excel files using ps.read_excel, but it causes this error: "Max iterations 
> (100) reached for batch Resolution, please set 
> 'spark.sql.analyzer.maxIterations' to a larger value". Increasing 
> spark.sql.analyzer.maxIterations to larger values does not resolve the error, 
> it just increases the execution time more and more as I try larger values.
>  
> Excel files are fairly simple (1 sheet, 8 columns, 2000 rows) and there are 
> only a few files (5)
>  
> Sample code to reproduce:
> ```python
> import pyspark.pandas as ps
> from pyspark.sql import DataFrame
> from pyspark.sql.functions import col, lit
>  
> adls_full_path: str = 
> f"abfss://contai...@azurestorageaccountname.dfs.core.windows.net/path/2024-04-01/filenamewithwildcards*.xlsx"
>  
> input_df: DataFrame = (
>     ps
>     .read_excel(adls_full_path)
>     .to_spark()
>     .withColumn("metadata_file_path", col("_metadata.file_path"))
> )
> ```
>  
> This code will raise the following error on .withColumn("metadata_file_path", 
> col("_metadata.file_path")):
> ``
> Py4JJavaError: An error occurred while calling o1835.withColumn. : 
> java.lang.RuntimeException: Max iterations (100) reached for batch 
> Resolution, please set 'spark.sql.analyzer.maxIterations' to a larger value. 
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:352)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.executeBatch$1(RuleExecutor.scala:289)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$9(RuleExecutor.scala:382)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$9$adapted(RuleExecutor.scala:382)
>  at scala.collection.immutable.List.foreach(List.scala:431) at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:382)
>  at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) 
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:256)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeSameContext(Analyzer.scala:415)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:408)
>  at 
> org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:322)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:408) 
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:341) 
> at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:248)
>  at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:166)
>  at 
> org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:248)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:393)
>  at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:407)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:392)
>  at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:244)
>  at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) 
> at 
> org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:394)
>  at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$4(QueryExecution.scala:573)
>  at 
> org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:1079)
>  at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:573)
>  at 
> com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
>  at 
> org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:569)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1175) at 
> 

[jira] [Updated] (SPARK-47948) Upgrade the minimum Pandas version to 2.0.0

2024-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-47948:
--
Summary: Upgrade the minimum Pandas version to 2.0.0  (was: Bump Pandas to 
2.0.0)

> Upgrade the minimum Pandas version to 2.0.0
> ---
>
> Key: SPARK-47948
> URL: https://issues.apache.org/jira/browse/SPARK-47948
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Bump up the minimum version of Pandas from 1.4.4 to 2.0.0 to support Pandas 
> API on Spark from Apache Spark 4.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47957) pyspark.pandas.read_excel can't get _metadata because it causes a "max iterations reached for batch Resolution" error

2024-04-23 Thread Christos Karras (Jira)
Christos Karras created SPARK-47957:
---

 Summary: pyspark.pandas.read_excel can't get _metadata because it 
causes a "max iterations reached for batch Resolution" error
 Key: SPARK-47957
 URL: https://issues.apache.org/jira/browse/SPARK-47957
 Project: Spark
  Issue Type: Bug
  Components: Pandas API on Spark
Affects Versions: 3.5.1
Reporter: Christos Karras


I'm trying to add _metadata.file_path to a Spark DataFrame that was read from 
Excel files using ps.read_excel, but it causes this error: "Max iterations 
(100) reached for batch Resolution, please set 
'spark.sql.analyzer.maxIterations' to a larger value". Increasing 
spark.sql.analyzer.maxIterations to larger values does not resolve the error, 
it just increases the execution time more and more as I try larger values.
 
Excel files are fairly simple (1 sheet, 8 columns, 2000 rows) and there are 
only a few files (5)

 
Sample code to reproduce:
```python
import pyspark.pandas as ps
from pyspark.sql import DataFrame
from pyspark.sql.functions import col, lit
 
adls_full_path: str = 
f"abfss://contai...@azurestorageaccountname.dfs.core.windows.net/path/2024-04-01/filenamewithwildcards*.xlsx"
 
input_df: DataFrame = (
    ps
    .read_excel(adls_full_path)
    .to_spark()
    .withColumn("metadata_file_path", col("_metadata.file_path"))
)
```

 

This code will raise the following error on .withColumn("metadata_file_path", 
col("_metadata.file_path")):

``

Py4JJavaError: An error occurred while calling o1835.withColumn. : 
java.lang.RuntimeException: Max iterations (100) reached for batch Resolution, 
please set 'spark.sql.analyzer.maxIterations' to a larger value. at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:352)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeBatch$1(RuleExecutor.scala:289)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$9(RuleExecutor.scala:382)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$9$adapted(RuleExecutor.scala:382)
 at scala.collection.immutable.List.foreach(List.scala:431) at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:382)
 at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:256)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeSameContext(Analyzer.scala:415)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$execute$1(Analyzer.scala:408)
 at 
org.apache.spark.sql.catalyst.analysis.AnalysisContext$.withNewAnalysisContext(Analyzer.scala:322)
 at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:408) 
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:341) 
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:248)
 at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:166)
 at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:248)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:393)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:407)
 at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:392)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:244)
 at com.databricks.spark.util.FrameProfiler$.record(FrameProfiler.scala:94) at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:394)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$4(QueryExecution.scala:573)
 at 
org.apache.spark.sql.execution.QueryExecution$.withInternalError(QueryExecution.scala:1079)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$2(QueryExecution.scala:573)
 at 
com.databricks.util.LexicalThreadLocal$Handle.runWith(LexicalThreadLocal.scala:63)
 at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:569)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:1175) at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:569)
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:238)
 at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:237)
 at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:219)
 at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:102) at 

[jira] [Resolved] (SPARK-47949) MsSQLServer: Bump up docker image version to2022-CU12-GDR1-ubuntu-22.04

2024-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47949.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46176
[https://github.com/apache/spark/pull/46176]

> MsSQLServer: Bump up docker image version to2022-CU12-GDR1-ubuntu-22.04
> ---
>
> Key: SPARK-47949
> URL: https://issues.apache.org/jira/browse/SPARK-47949
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> https://mcr.microsoft.com/en-us/product/mssql/server/tags



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47949) MsSQLServer: Bump up docker image version to2022-CU12-GDR1-ubuntu-22.04

2024-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47949:
-

Assignee: Kent Yao

> MsSQLServer: Bump up docker image version to2022-CU12-GDR1-ubuntu-22.04
> ---
>
> Key: SPARK-47949
> URL: https://issues.apache.org/jira/browse/SPARK-47949
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>
> https://mcr.microsoft.com/en-us/product/mssql/server/tags



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47953) MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server

2024-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-47953:
-

Assignee: Kent Yao

> MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server
> --
>
> Key: SPARK-47953
> URL: https://issues.apache.org/jira/browse/SPARK-47953
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47953) MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server

2024-04-23 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-47953.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46177
[https://github.com/apache/spark/pull/46177]

> MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server
> --
>
> Key: SPARK-47953
> URL: https://issues.apache.org/jira/browse/SPARK-47953
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47873) Write collated strings to hive as regular strings

2024-04-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47873:
---

Assignee: Stefan Kandic

> Write collated strings to hive as regular strings
> -
>
> Key: SPARK-47873
> URL: https://issues.apache.org/jira/browse/SPARK-47873
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
>
> As hive doesn't support collations we should write collated strings with a 
> regular string type but keep the collation in table metadata to properly read 
> them back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47873) Write collated strings to hive as regular strings

2024-04-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47873.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46083
[https://github.com/apache/spark/pull/46083]

> Write collated strings to hive as regular strings
> -
>
> Key: SPARK-47873
> URL: https://issues.apache.org/jira/browse/SPARK-47873
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 4.0.0
>Reporter: Stefan Kandic
>Assignee: Stefan Kandic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> As hive doesn't support collations we should write collated strings with a 
> regular string type but keep the collation in table metadata to properly read 
> them back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47956) sanity check for unresolved LCA reference

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47956:
---
Labels: pull-request-available  (was: )

> sanity check for unresolved LCA reference
> -
>
> Key: SPARK-47956
> URL: https://issues.apache.org/jira/browse/SPARK-47956
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47956) sanity check for unresolved LCA reference

2024-04-23 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-47956:
---

 Summary: sanity check for unresolved LCA reference
 Key: SPARK-47956
 URL: https://issues.apache.org/jira/browse/SPARK-47956
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47954) Support creating ingress entry for external UI access

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47954:
---
Labels: pull-request-available  (was: )

> Support creating ingress entry for external UI access
> -
>
> Key: SPARK-47954
> URL: https://issues.apache.org/jira/browse/SPARK-47954
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47954) Support creating ingress entry for external UI access

2024-04-23 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-47954:
-

 Summary: Support creating ingress entry for external UI access
 Key: SPARK-47954
 URL: https://issues.apache.org/jira/browse/SPARK-47954
 Project: Spark
  Issue Type: Test
  Components: Kubernetes
Affects Versions: 4.0.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47952:
---
Labels: pull-request-available  (was: )

> Support retrieving the real SparkConnectService GRPC address and port 
> programmatically when running on Yarn
> ---
>
> Key: SPARK-47952
> URL: https://issues.apache.org/jira/browse/SPARK-47952
> Project: Spark
>  Issue Type: Story
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Priority: Minor
>  Labels: pull-request-available
>
> 1.User Story:
> Our data analysts and data scientists use Jupyter notebooks provisioned on 
> Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
> in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
> significant local memory if the job is heavy, and the total resource pool of 
> k8s for notebooks is limited. To leverage the abundant resources of our 
> Hadoop cluster for scalability purposes, we aim to utilize SparkConnect. This 
> allows the driver on Yarn with SparkConnectService started and uses 
> SparkConnect client to connect to the remote driver.
> To provide a seamless experience with one command startup for both server and 
> client, we've wrapped the following processes in one script:
> 1) Start a local coordinator server (implemented by us, not in this PR) with 
> a specified port.
> 2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
> user-input Spark configurations and the local coordinator server's address 
> and port. Append an additional listener class in the configuration for 
> SparkConnectService callback with the actual address and port on Yarn to the 
> coordinator server.
> 3) Wait for the coordinator server to receive the address callback from the 
> SparkConnectService on Yarn and export the real address.
> 4) Start the client (pyspark --remote) with the remote address.
> Finally, a remote SparkConnect Server is started on Yarn with a local 
> SparkConnect client connected. Users no longer need to start the server 
> beforehand and connect to the remote server after they manually explore the 
> address on Yarn.
> 2.Problem statement of this change:
> 1) The specified port for the SparkConnectService GRPC server might be 
> occupied on the node of the Hadoop Cluster. To increase the success rate of 
> startup, it needs to retry on conflicts rather than fail directly.
> 2) Because the final binding port could be uncertain based on #1 and the 
> remote address is unpredictable on Yarn, we need to retrieve the address and 
> port programmatically and inject it automatically on the start of `pyspark 
> --remote`. The SparkConnectService needs to communicate its location back to 
> the launcher side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47352) Fix Upper, Lower, InitCap collation awareness

2024-04-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47352.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46104
[https://github.com/apache/spark/pull/46104]

> Fix Upper, Lower, InitCap collation awareness
> -
>
> Key: SPARK-47352
> URL: https://issues.apache.org/jira/browse/SPARK-47352
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-46841) Language support for collations

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46841?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-46841:
---
Labels: pull-request-available  (was: )

> Language support for collations
> ---
>
> Key: SPARK-46841
> URL: https://issues.apache.org/jira/browse/SPARK-46841
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: Aleksandar Tomic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

2024-04-23 Thread TakawaAkirayo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TakawaAkirayo updated SPARK-47952:
--
Description: 
1.User Story:
Our data analysts and data scientists use Jupyter notebooks provisioned on 
Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
significant local memory if the job is heavy, and the total resource pool of 
k8s for notebooks is limited. To leverage the abundant resources of our Hadoop 
cluster for scalability purposes, we aim to utilize SparkConnect. This allows 
the driver on Yarn with SparkConnectService started and uses SparkConnect 
client to connect to the remote driver.

To provide a seamless experience with one command startup for both server and 
client, we've wrapped the following processes in one script:

1) Start a local coordinator server (implemented by us, not in this PR) with a 
specified port.
2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
user-input Spark configurations and the local coordinator server's address and 
port. Append an additional listener class in the configuration for 
SparkConnectService callback with the actual address and port on Yarn to the 
coordinator server.
3) Wait for the coordinator server to receive the address callback from the 
SparkConnectService on Yarn and export the real address.
4) Start the client (pyspark --remote) with the remote address.

Finally, a remote SparkConnect Server is started on Yarn with a local 
SparkConnect client connected. Users no longer need to start the server 
beforehand and connect to the remote server after they manually explore the 
address on Yarn.

2.Problem statement of this change:
1) The specified port for the SparkConnectService GRPC server might be occupied 
on the node of the Hadoop Cluster. To increase the success rate of startup, it 
needs to retry on conflicts rather than fail directly.
2) Because the final binding port could be uncertain based on #1 and the remote 
address is unpredictable on Yarn, we need to retrieve the address and port 
programmatically and inject it automatically on the start of `pyspark 
--remote`. The SparkConnectService needs to communicate its location back to 
the launcher side.

  was:
1.User Story:
Our data analysts and data scientists use Jupyter notebooks provisioned on 
Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
significant local memory if the job is heavy, and the total resource pool of 
k8s for notebooks is limited. To leverage the abundant resources of our Hadoop 
cluster for scalability purposes, we aim to utilize SparkConnect. This allows 
the driver on Yarn with SparkConnectService started and uses SparkConnect 
client to connect to the remote driver.

To provide a seamless experience with one command startup for both server and 
client, we've wrapped the following processes in one script:

1) Start a local coordinator server (implemented by us, not in this PR) with a 
specified port.
2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
user-input Spark configurations and the local coordinator server's address and 
port. Append an additional listener class in the configuration for 
SparkConnectService callback with the actual address and port on Yarn to the 
coordinator server.
3) Wait for the coordinator server to receive the address callback from the 
SparkConnectService on Yarn and export the real address.
4) Start the client (pyspark --remote) with the remote address.

2.Problem statement of this change:
1) The specified port for the SparkConnectService GRPC server might be occupied 
on the node of the Hadoop Cluster. To increase the success rate of startup, it 
needs to retry on conflicts rather than fail directly.
2) Because the final binding port could be uncertain based on #1 and the remote 
address is unpredictable on Yarn, we need to retrieve the address and port 
programmatically and inject it automatically on the start of `pyspark 
--remote`. The SparkConnectService needs to communicate its location back to 
the launcher side.


> Support retrieving the real SparkConnectService GRPC address and port 
> programmatically when running on Yarn
> ---
>
> Key: SPARK-47952
> URL: https://issues.apache.org/jira/browse/SPARK-47952
> Project: Spark
>  Issue Type: Story
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Priority: Minor
>
> 1.User Story:
> Our data analysts and data scientists use Jupyter notebooks provisioned on 
> Kubernetes (k8s) with limited CPU/memory resources to run 

[jira] [Resolved] (SPARK-47805) [Arbitrary State Support] State TTL support - MapState

2024-04-23 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-47805.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 45991
[https://github.com/apache/spark/pull/45991]

> [Arbitrary State Support] State TTL support - MapState
> --
>
> Key: SPARK-47805
> URL: https://issues.apache.org/jira/browse/SPARK-47805
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 4.0.0
>Reporter: Eric Marnadi
>Assignee: Eric Marnadi
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Add support for expiring state value based on ttl for Map State in 
> transformWithState operator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

2024-04-23 Thread TakawaAkirayo (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

TakawaAkirayo updated SPARK-47952:
--
Description: 
1.User Story:
Our data analysts and data scientists use Jupyter notebooks provisioned on 
Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
significant local memory if the job is heavy, and the total resource pool of 
k8s for notebooks is limited. To leverage the abundant resources of our Hadoop 
cluster for scalability purposes, we aim to utilize SparkConnect. This allows 
the driver on Yarn with SparkConnectService started and uses SparkConnect 
client to connect to the remote driver.

To provide a seamless experience with one command startup for both server and 
client, we've wrapped the following processes in one script:

1) Start a local coordinator server (implemented by us, not in this PR) with a 
specified port.
2) Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
user-input Spark configurations and the local coordinator server's address and 
port. Append an additional listener class in the configuration for 
SparkConnectService callback with the actual address and port on Yarn to the 
coordinator server.
3) Wait for the coordinator server to receive the address callback from the 
SparkConnectService on Yarn and export the real address.
4) Start the client (pyspark --remote) with the remote address.

2.Problem statement of this change:
1) The specified port for the SparkConnectService GRPC server might be occupied 
on the node of the Hadoop Cluster. To increase the success rate of startup, it 
needs to retry on conflicts rather than fail directly.
2) Because the final binding port could be uncertain based on #1 and the remote 
address is unpredictable on Yarn, we need to retrieve the address and port 
programmatically and inject it automatically on the start of `pyspark 
--remote`. The SparkConnectService needs to communicate its location back to 
the launcher side.

  was:
User Story:
Our data analysts and data scientists use Jupyter notebooks provisioned on 
Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
significant local memory if the job is heavy, and the total resource pool of 
k8s for notebooks is limited. To leverage the abundant resources of our Hadoop 
cluster for scalability purposes, we aim to utilize SparkConnect. This allows 
the driver on Yarn with SparkConnectService started and uses SparkConnect 
client to connect to the remote driver.

To provide a seamless experience with one command startup for both server and 
client, we've wrapped the following processes in one script:

1. Start a local coordinator server (implemented by us, not in this PR) with a 
specified port.
2. Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
user-input Spark configurations and the local coordinator server's address and 
port. Append an additional listener class in the configuration for 
SparkConnectService callback with the actual address and port on Yarn to the 
coordinator server.
3. Wait for the coordinator server to receive the address callback from the 
SparkConnectService on Yarn and export the real address.
4. Start the client (pyspark --remote) with the remote address.

Problem statement of this change:
1. The specified port for the SparkConnectService GRPC server might be occupied 
on the node of the Hadoop Cluster. To increase the success rate of startup, it 
needs to retry on conflicts rather than fail directly.
2. Because the final binding port could be uncertain based on #1 and the remote 
address is unpredictable on Yarn, we need to retrieve the address and port 
programmatically and inject it automatically on the start of `pyspark 
--remote`. The SparkConnectService needs to communicate its location back to 
the launcher side.


> Support retrieving the real SparkConnectService GRPC address and port 
> programmatically when running on Yarn
> ---
>
> Key: SPARK-47952
> URL: https://issues.apache.org/jira/browse/SPARK-47952
> Project: Spark
>  Issue Type: Story
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Priority: Minor
>
> 1.User Story:
> Our data analysts and data scientists use Jupyter notebooks provisioned on 
> Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
> in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
> significant local memory if the job is heavy, and the total resource pool of 
> k8s for notebooks is limited. To leverage the abundant resources of 

[jira] [Updated] (SPARK-47953) MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47953:
---
Labels: pull-request-available  (was: )

> MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server
> --
>
> Key: SPARK-47953
> URL: https://issues.apache.org/jira/browse/SPARK-47953
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47953) MsSQLServer: Document Mapping Spark SQL Data Types to Microsoft SQL Server

2024-04-23 Thread Kent Yao (Jira)
Kent Yao created SPARK-47953:


 Summary: MsSQLServer: Document Mapping Spark SQL Data Types to 
Microsoft SQL Server
 Key: SPARK-47953
 URL: https://issues.apache.org/jira/browse/SPARK-47953
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

2024-04-23 Thread TakawaAkirayo (Jira)
TakawaAkirayo created SPARK-47952:
-

 Summary: Support retrieving the real SparkConnectService GRPC 
address and port programmatically when running on Yarn
 Key: SPARK-47952
 URL: https://issues.apache.org/jira/browse/SPARK-47952
 Project: Spark
  Issue Type: Story
  Components: Connect
Affects Versions: 4.0.0
Reporter: TakawaAkirayo


User Story:
Our data analysts and data scientists use Jupyter notebooks provisioned on 
Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
significant local memory if the job is heavy, and the total resource pool of 
k8s for notebooks is limited. To leverage the abundant resources of our Hadoop 
cluster for scalability purposes, we aim to utilize SparkConnect. This allows 
the driver on Yarn with SparkConnectService started and uses SparkConnect 
client to connect to the remote driver.

To provide a seamless experience with one command startup for both server and 
client, we've wrapped the following processes in one script:

1. Start a local coordinator server (implemented by us, not in this PR) with a 
specified port.
2. Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
user-input Spark configurations and the local coordinator server's address and 
port. Append an additional listener class in the configuration for 
SparkConnectService callback with the actual address and port on Yarn to the 
coordinator server.
3. Wait for the coordinator server to receive the address callback from the 
SparkConnectService on Yarn and export the real address.
4. Start the client (pyspark --remote) with the remote address.

Problem statement of this change:
1. The specified port for the SparkConnectService GRPC server might be occupied 
on the node of the Hadoop Cluster. To increase the success rate of startup, it 
needs to retry on conflicts rather than fail directly.
2. Because the final binding port could be uncertain based on #1 and the remote 
address is unpredictable on Yarn, we need to retrieve the address and port 
programmatically and inject it automatically on the start of `pyspark 
--remote`. The SparkConnectService needs to communicate its location back to 
the launcher side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-47952) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

2024-04-23 Thread TakawaAkirayo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-47952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17839980#comment-17839980
 ] 

TakawaAkirayo commented on SPARK-47952:
---

I'm working on it

> Support retrieving the real SparkConnectService GRPC address and port 
> programmatically when running on Yarn
> ---
>
> Key: SPARK-47952
> URL: https://issues.apache.org/jira/browse/SPARK-47952
> Project: Spark
>  Issue Type: Story
>  Components: Connect
>Affects Versions: 4.0.0
>Reporter: TakawaAkirayo
>Priority: Minor
>
> User Story:
> Our data analysts and data scientists use Jupyter notebooks provisioned on 
> Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
> in the terminal via Yarn Client mode. However, Yarn Client mode consumes 
> significant local memory if the job is heavy, and the total resource pool of 
> k8s for notebooks is limited. To leverage the abundant resources of our 
> Hadoop cluster for scalability purposes, we aim to utilize SparkConnect. This 
> allows the driver on Yarn with SparkConnectService started and uses 
> SparkConnect client to connect to the remote driver.
> To provide a seamless experience with one command startup for both server and 
> client, we've wrapped the following processes in one script:
> 1. Start a local coordinator server (implemented by us, not in this PR) with 
> a specified port.
> 2. Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
> user-input Spark configurations and the local coordinator server's address 
> and port. Append an additional listener class in the configuration for 
> SparkConnectService callback with the actual address and port on Yarn to the 
> coordinator server.
> 3. Wait for the coordinator server to receive the address callback from the 
> SparkConnectService on Yarn and export the real address.
> 4. Start the client (pyspark --remote) with the remote address.
> Problem statement of this change:
> 1. The specified port for the SparkConnectService GRPC server might be 
> occupied on the node of the Hadoop Cluster. To increase the success rate of 
> startup, it needs to retry on conflicts rather than fail directly.
> 2. Because the final binding port could be uncertain based on #1 and the 
> remote address is unpredictable on Yarn, we need to retrieve the address and 
> port programmatically and inject it automatically on the start of `pyspark 
> --remote`. The SparkConnectService needs to communicate its location back to 
> the launcher side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47950) Add Java API Module for Spark Operator

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47950:
---
Labels: pull-request-available  (was: )

> Add Java API Module for Spark Operator
> --
>
> Key: SPARK-47950
> URL: https://issues.apache.org/jira/browse/SPARK-47950
> Project: Spark
>  Issue Type: Sub-task
>  Components: k8s
>Affects Versions: kubernetes-operator-0.1.0
>Reporter: Zhou JIANG
>Priority: Major
>  Labels: pull-request-available
>
> Spark Operator API refers to the 
> [CustomResourceDefinition|https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/]
>  __ that represents the spec for Spark Application in k8s.
> This aims to add Java API library for Spark Operator, with the ability to 
> generate yaml spec.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47951) Support retrieving the real SparkConnectService GRPC address and port programmatically when running on Yarn

2024-04-23 Thread TakawaAkirayo (Jira)
TakawaAkirayo created SPARK-47951:
-

 Summary: Support retrieving the real SparkConnectService GRPC 
address and port programmatically when running on Yarn
 Key: SPARK-47951
 URL: https://issues.apache.org/jira/browse/SPARK-47951
 Project: Spark
  Issue Type: Story
  Components: Connect
Affects Versions: 4.0.0
Reporter: TakawaAkirayo


1. {*}User Story{*}:
Our data analysts and data scientists use Jupyter notebooks provisioned on 
Kubernetes (k8s) with limited CPU/memory resources to run Spark-shell/pyspark 
in the terminal via Yarn Client mode.
However, Yarn Client mode consumes significant local memory if the job is 
heavy, and the total resource pool of k8s for notebooks is limited.
To leverage the abundant resources of our Hadoop cluster for scalability 
purposes, we aim to utilize SparkConnect.
This allows the driver on Yarn with SparkConnectService started and uses 
SparkConnect client to connect to the remote driver.

To provide a seamless experience with one command startup for both server and 
client, we've wrapped the following processes in one script:

1). Start a local coordinator server (implemented by us internally, not in this 
PR) in the host of jupyter notebook.
2). Start SparkConnectServer by spark-submit via Yarn Cluster mode with 
user-input Spark configurations and the local coordinator server's address and 
port.
    Append an additional listener class in the configuration for 
SparkConnectService callback with the actual address and port on Yarn to the 
coordinator server.
3). Wait for the coordinator server to receive the address callback from the 
SparkConnectService on Yarn and export the real address.
4). Start the client (pyspark --remote $callback_address) with the remote 
address.

2. {*}Problem statement of this change{*}:
1). The specified port for the SparkConnectService GRPC server might be 
occupied on the node of the Hadoop Cluster.
    To increase the success rate of startup, it needs to retry on conflicts 
rather than fail directly.
2). Because the final binding port could be uncertain based on #1 when retry 
and the remote address is unpredictable on Yarn,
    we need to retrieve the address and port programmatically and inject it 
automatically on the start of `pyspark --remote`.
    To get the address of SparkConnectService on Yarn programmatically, The 
SparkConnectService needs to communicate its location back to the launcher side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47945) MsSQLServer: Document Mapping Spark SQL Data Types from Microsoft SQL Server and add tests

2024-04-23 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-47945.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46173
[https://github.com/apache/spark/pull/46173]

> MsSQLServer: Document Mapping Spark SQL Data Types from Microsoft SQL Server 
> and add tests
> --
>
> Key: SPARK-47945
> URL: https://issues.apache.org/jira/browse/SPARK-47945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47945) MsSQLServer: Document Mapping Spark SQL Data Types from Microsoft SQL Server and add tests

2024-04-23 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-47945:


Assignee: Kent Yao

> MsSQLServer: Document Mapping Spark SQL Data Types from Microsoft SQL Server 
> and add tests
> --
>
> Key: SPARK-47945
> URL: https://issues.apache.org/jira/browse/SPARK-47945
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47949) MsSQLServer: Bump up docker image version to2022-CU12-GDR1-ubuntu-22.04

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47949:
---
Labels: pull-request-available  (was: )

> MsSQLServer: Bump up docker image version to2022-CU12-GDR1-ubuntu-22.04
> ---
>
> Key: SPARK-47949
> URL: https://issues.apache.org/jira/browse/SPARK-47949
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>
> https://mcr.microsoft.com/en-us/product/mssql/server/tags



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47950) Add Java API Module for Spark Operator

2024-04-23 Thread Zhou JIANG (Jira)
Zhou JIANG created SPARK-47950:
--

 Summary: Add Java API Module for Spark Operator
 Key: SPARK-47950
 URL: https://issues.apache.org/jira/browse/SPARK-47950
 Project: Spark
  Issue Type: Sub-task
  Components: k8s
Affects Versions: kubernetes-operator-0.1.0
Reporter: Zhou JIANG


Spark Operator API refers to the 
[CustomResourceDefinition|https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/]
 __ that represents the spec for Spark Application in k8s.

This aims to add Java API library for Spark Operator, with the ability to 
generate yaml spec.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47948) Bump Pandas to 2.0.0

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47948:
---
Labels: pull-request-available  (was: )

> Bump Pandas to 2.0.0
> 
>
> Key: SPARK-47948
> URL: https://issues.apache.org/jira/browse/SPARK-47948
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Bump up the minimum version of Pandas from 1.4.4 to 2.0.0 to support Pandas 
> API on Spark from Apache Spark 4.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47949) MsSQLServer: Bump up docker image version to2022-CU12-GDR1-ubuntu-22.04

2024-04-23 Thread Kent Yao (Jira)
Kent Yao created SPARK-47949:


 Summary: MsSQLServer: Bump up docker image version 
to2022-CU12-GDR1-ubuntu-22.04
 Key: SPARK-47949
 URL: https://issues.apache.org/jira/browse/SPARK-47949
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Docker
Affects Versions: 4.0.0
Reporter: Kent Yao


https://mcr.microsoft.com/en-us/product/mssql/server/tags



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-47948) Bump Pandas to 2.0.0

2024-04-23 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-47948:
---

 Summary: Bump Pandas to 2.0.0
 Key: SPARK-47948
 URL: https://issues.apache.org/jira/browse/SPARK-47948
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


Bump up the minimum version of Pandas from 1.4.4 to 2.0.0 to support Pandas API 
on Spark from Apache Spark 4.0.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-47351) StringToMap & Mask (all collations)

2024-04-23 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-47351:
---
Labels: pull-request-available  (was: )

> StringToMap & Mask (all collations)
> ---
>
> Key: SPARK-47351
> URL: https://issues.apache.org/jira/browse/SPARK-47351
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-47412) StringLPad, StringRPad (all collations)

2024-04-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-47412.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46041
[https://github.com/apache/spark/pull/46041]

> StringLPad, StringRPad (all collations)
> ---
>
> Key: SPARK-47412
> URL: https://issues.apache.org/jira/browse/SPARK-47412
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Gideon P
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Enable collation support for the *StringLPad* & *StringRPad* built-in string 
> functions in Spark. First confirm what is the expected behaviour for these 
> functions when given collated strings, then move on to the implementation 
> that would enable handling strings of all collation types. Implement the 
> corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringLPad* & *StringRPad* 
> functions so that they support all collation types currently supported in 
> Spark. To understand what changes were introduced in order to enable full 
> collation support for other existing functions in Spark, take a look at the 
> Spark PRs and Jira tickets for completed tasks in this parent (for example: 
> Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-47412) StringLPad, StringRPad (all collations)

2024-04-23 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-47412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-47412:
---

Assignee: Gideon P

> StringLPad, StringRPad (all collations)
> ---
>
> Key: SPARK-47412
> URL: https://issues.apache.org/jira/browse/SPARK-47412
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Uroš Bojanić
>Assignee: Gideon P
>Priority: Major
>  Labels: pull-request-available
>
> Enable collation support for the *StringLPad* & *StringRPad* built-in string 
> functions in Spark. First confirm what is the expected behaviour for these 
> functions when given collated strings, then move on to the implementation 
> that would enable handling strings of all collation types. Implement the 
> corresponding unit tests (CollationStringExpressionsSuite) and E2E tests 
> (CollationSuite) to reflect how this function should be used with collation 
> in SparkSQL, and feel free to use your chosen Spark SQL Editor to experiment 
> with the existing functions to learn more about how they work. In addition, 
> look into the possible use-cases and implementation of similar functions 
> within other other open-source DBMS, such as 
> [PostgreSQL|https://www.postgresql.org/docs/].
>  
> The goal for this Jira ticket is to implement the *StringLPad* & *StringRPad* 
> functions so that they support all collation types currently supported in 
> Spark. To understand what changes were introduced in order to enable full 
> collation support for other existing functions in Spark, take a look at the 
> Spark PRs and Jira tickets for completed tasks in this parent (for example: 
> Contains, StartsWith, EndsWith).
>  
> Read more about ICU [Collation Concepts|http://example.com/] and 
> [Collator|http://example.com/] class. Also, refer to the Unicode Technical 
> Standard for 
> [collation|https://www.unicode.org/reports/tr35/tr35-collation.html#Collation_Type_Fallback].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org