[jira] [Reopened] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date

2023-10-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-38341:
-

> Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
> 
>
> Key: SPARK-38341
> URL: https://issues.apache.org/jira/browse/SPARK-38341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: davon.cao
>Priority: Major
>
> Step to reproduce:
> Version of spark sql: 3.2.1(latest  version in maven repository)
> Run sql:
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31
> actual: 2020-05-30 (x)
>  
> Version of spark sql: 2.4.3 
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31 
> actual: 2020-05-31 (/)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date

2023-10-09 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773146#comment-17773146
 ] 

Wenchen Fan commented on SPARK-38341:
-

It seems like a common behavior to preserve the end-of-month information, see 
also https://issues.apache.org/jira/browse/SPARK-38341

[~maxgekk] can you take a look? This is probably caused by the switch to java 8 
time API.

> Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
> 
>
> Key: SPARK-38341
> URL: https://issues.apache.org/jira/browse/SPARK-38341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: davon.cao
>Priority: Major
>
> Step to reproduce:
> Version of spark sql: 3.2.1(latest  version in maven repository)
> Run sql:
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31
> actual: 2020-05-30 (x)
>  
> Version of spark sql: 2.4.3 
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31 
> actual: 2020-05-31 (/)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38341) Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date

2023-10-09 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773153#comment-17773153
 ] 

Max Gekk commented on SPARK-38341:
--

The SQL migration guide documents the behaviour change at 
https://spark.apache.org/docs/latest/sql-migration-guide.html#upgrading-from-spark-sql-24-to-30:
"In Spark 3.0, the add_months function does not adjust the resulting date to a 
last day of month if the original date is a last day of months. For example, 
select add_months(DATE'2019-02-28', 1) results 2019-03-28. In Spark version 2.4 
and below, the resulting date is adjusted when the original date is a last day 
of months. For example, adding a month to 2019-02-28 results in 2019-03-31."

> Spark sql: 3.2.1 - Function of add_ Months returns an incorrect date
> 
>
> Key: SPARK-38341
> URL: https://issues.apache.org/jira/browse/SPARK-38341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
>Reporter: davon.cao
>Priority: Major
>
> Step to reproduce:
> Version of spark sql: 3.2.1(latest  version in maven repository)
> Run sql:
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31
> actual: 2020-05-30 (x)
>  
> Version of spark sql: 2.4.3 
> spark.sql("""SELECT ADD_MONTHS(last_day('2020-06-30'), -1)""").toPandas()
> expect: 2020-05-31 
> actual: 2020-05-31 (/)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45463) Allow ShuffleDriverComponent to support reliable store with specified executorId

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45463:
--

Assignee: Apache Spark

> Allow ShuffleDriverComponent to support reliable store with specified 
> executorId
> 
>
> Key: SPARK-45463
> URL: https://issues.apache.org/jira/browse/SPARK-45463
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.1
>Reporter: zhoubin
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> After SPARK-42689, ShuffleDriverComponent.supportsReliableStorage is 
> determined globally.
> Downstream projects may have different shuffle policy(cause by cluster loads 
> or columnar support) for different stages, for example or Apache Uniffle with 
> Gluten or Apache Celeborn.
> In this situation, ShuffleDriverComponent should use the mapTrackerMaster to 
> decide weather support reliable storage by the specified executorId



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45463) Allow ShuffleDriverComponent to support reliable store with specified executorId

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45463:
--

Assignee: (was: Apache Spark)

> Allow ShuffleDriverComponent to support reliable store with specified 
> executorId
> 
>
> Key: SPARK-45463
> URL: https://issues.apache.org/jira/browse/SPARK-45463
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.1
>Reporter: zhoubin
>Priority: Major
>  Labels: pull-request-available
>
> After SPARK-42689, ShuffleDriverComponent.supportsReliableStorage is 
> determined globally.
> Downstream projects may have different shuffle policy(cause by cluster loads 
> or columnar support) for different stages, for example or Apache Uniffle with 
> Gluten or Apache Celeborn.
> In this situation, ShuffleDriverComponent should use the mapTrackerMaster to 
> decide weather support reliable storage by the specified executorId



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45455) [SQL][JDBC] Improve the rename interface of Postgres Dialect

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45455:
--

Assignee: Apache Spark

> [SQL][JDBC] Improve the rename interface of Postgres Dialect
> 
>
> Key: SPARK-45455
> URL: https://issues.apache.org/jira/browse/SPARK-45455
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: 蔡灿材
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Improve the rename interface of pgdialect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45455) [SQL][JDBC] Improve the rename interface of Postgres Dialect

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45455:
--

Assignee: (was: Apache Spark)

> [SQL][JDBC] Improve the rename interface of Postgres Dialect
> 
>
> Key: SPARK-45455
> URL: https://issues.apache.org/jira/browse/SPARK-45455
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: 蔡灿材
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> Improve the rename interface of pgdialect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45424) Regression in CSV schema inference when timestamps do not match specified timestampFormat

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45424:
--

Assignee: Apache Spark

> Regression in CSV schema inference when timestamps do not match specified 
> timestampFormat
> -
>
> Key: SPARK-45424
> URL: https://issues.apache.org/jira/browse/SPARK-45424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Andy Grove
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> There is a regression in Spark 3.5.0 when inferring the schema of CSV files 
> containing timestamps, where a column will be inferred as a timestamp even if 
> the contents do not match the specified timestampFormat.
> *Test Data*
> I have the following CSV file:
> {code:java}
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> {code}
> *Spark 3.4.0 Behavior (correct)*
> In Spark 3.4.0, if I specify the correct timestamp format, then the schema is 
> inferred as timestamp:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> If I specify an incompatible timestampFormat, then the schema is inferred as 
> string:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}
> *Spark 3.5.0*
> In Spark 3.5.0, the column will be inferred as timestamp even if the data 
> does not match the specified timestampFormat.
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> Reading the DataFrame then results in an error:
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text 
> '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45424) Regression in CSV schema inference when timestamps do not match specified timestampFormat

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45424:
--

Assignee: (was: Apache Spark)

> Regression in CSV schema inference when timestamps do not match specified 
> timestampFormat
> -
>
> Key: SPARK-45424
> URL: https://issues.apache.org/jira/browse/SPARK-45424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>
> There is a regression in Spark 3.5.0 when inferring the schema of CSV files 
> containing timestamps, where a column will be inferred as a timestamp even if 
> the contents do not match the specified timestampFormat.
> *Test Data*
> I have the following CSV file:
> {code:java}
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> {code}
> *Spark 3.4.0 Behavior (correct)*
> In Spark 3.4.0, if I specify the correct timestamp format, then the schema is 
> inferred as timestamp:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> If I specify an incompatible timestampFormat, then the schema is inferred as 
> string:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}
> *Spark 3.5.0*
> In Spark 3.5.0, the column will be inferred as timestamp even if the data 
> does not match the specified timestampFormat.
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> Reading the DataFrame then results in an error:
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text 
> '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45424) Regression in CSV schema inference when timestamps do not match specified timestampFormat

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45424:
--

Assignee: (was: Apache Spark)

> Regression in CSV schema inference when timestamps do not match specified 
> timestampFormat
> -
>
> Key: SPARK-45424
> URL: https://issues.apache.org/jira/browse/SPARK-45424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Andy Grove
>Priority: Major
>  Labels: pull-request-available
>
> There is a regression in Spark 3.5.0 when inferring the schema of CSV files 
> containing timestamps, where a column will be inferred as a timestamp even if 
> the contents do not match the specified timestampFormat.
> *Test Data*
> I have the following CSV file:
> {code:java}
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> {code}
> *Spark 3.4.0 Behavior (correct)*
> In Spark 3.4.0, if I specify the correct timestamp format, then the schema is 
> inferred as timestamp:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> If I specify an incompatible timestampFormat, then the schema is inferred as 
> string:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}
> *Spark 3.5.0*
> In Spark 3.5.0, the column will be inferred as timestamp even if the data 
> does not match the specified timestampFormat.
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> Reading the DataFrame then results in an error:
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text 
> '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45424) Regression in CSV schema inference when timestamps do not match specified timestampFormat

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45424:
--

Assignee: Apache Spark

> Regression in CSV schema inference when timestamps do not match specified 
> timestampFormat
> -
>
> Key: SPARK-45424
> URL: https://issues.apache.org/jira/browse/SPARK-45424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Andy Grove
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> There is a regression in Spark 3.5.0 when inferring the schema of CSV files 
> containing timestamps, where a column will be inferred as a timestamp even if 
> the contents do not match the specified timestampFormat.
> *Test Data*
> I have the following CSV file:
> {code:java}
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> {code}
> *Spark 3.4.0 Behavior (correct)*
> In Spark 3.4.0, if I specify the correct timestamp format, then the schema is 
> inferred as timestamp:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> If I specify an incompatible timestampFormat, then the schema is inferred as 
> string:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}
> *Spark 3.5.0*
> In Spark 3.5.0, the column will be inferred as timestamp even if the data 
> does not match the specified timestampFormat.
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> Reading the DataFrame then results in an error:
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text 
> '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45418) Change CURRENT_SCHEMA() column alias to match function name

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45418:
--

Assignee: Apache Spark

> Change CURRENT_SCHEMA() column alias to match function name
> ---
>
> Key: SPARK-45418
> URL: https://issues.apache.org/jira/browse/SPARK-45418
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.5.0
>Reporter: Michael Zhang
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45418) Change CURRENT_SCHEMA() column alias to match function name

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45418:
--

Assignee: (was: Apache Spark)

> Change CURRENT_SCHEMA() column alias to match function name
> ---
>
> Key: SPARK-45418
> URL: https://issues.apache.org/jira/browse/SPARK-45418
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 3.5.0
>Reporter: Michael Zhang
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45459) Remove the last 2 extra spaces in the automatically generated `sql-error-conditions.md` file

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45459:
--

Assignee: (was: Apache Spark)

> Remove the last 2 extra spaces in the automatically generated 
> `sql-error-conditions.md` file
> 
>
> Key: SPARK-45459
> URL: https://issues.apache.org/jira/browse/SPARK-45459
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45459) Remove the last 2 extra spaces in the automatically generated `sql-error-conditions.md` file

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45459:
--

Assignee: Apache Spark

> Remove the last 2 extra spaces in the automatically generated 
> `sql-error-conditions.md` file
> 
>
> Key: SPARK-45459
> URL: https://issues.apache.org/jira/browse/SPARK-45459
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: Apache Spark
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45464) [CORE] Fix yarn distribution build

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45464:
--

Assignee: (was: Apache Spark)

> [CORE] Fix yarn distribution build
> --
>
> Key: SPARK-45464
> URL: https://issues.apache.org/jira/browse/SPARK-45464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/spark/pull/43164] introduced a regression in:
>  
> ```
> ./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn
> ```
>  
> this needs to be fixed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45464) [CORE] Fix yarn distribution build

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45464:
--

Assignee: Apache Spark

> [CORE] Fix yarn distribution build
> --
>
> Key: SPARK-45464
> URL: https://issues.apache.org/jira/browse/SPARK-45464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/spark/pull/43164] introduced a regression in:
>  
> ```
> ./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn
> ```
>  
> this needs to be fixed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45464) [CORE] Fix yarn distribution build

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45464:
--

Assignee: Apache Spark

> [CORE] Fix yarn distribution build
> --
>
> Key: SPARK-45464
> URL: https://issues.apache.org/jira/browse/SPARK-45464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/spark/pull/43164] introduced a regression in:
>  
> ```
> ./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn
> ```
>  
> this needs to be fixed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45464) [CORE] Fix yarn distribution build

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45464:
--

Assignee: (was: Apache Spark)

> [CORE] Fix yarn distribution build
> --
>
> Key: SPARK-45464
> URL: https://issues.apache.org/jira/browse/SPARK-45464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 4.0.0
>Reporter: Hasnain Lakhani
>Priority: Major
>  Labels: pull-request-available
>
> [https://github.com/apache/spark/pull/43164] introduced a regression in:
>  
> ```
> ./dev/make-distribution.sh --tgz -Phive -Phive-thriftserver -Pyarn
> ```
>  
> this needs to be fixed



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45463) Allow ShuffleDriverComponent to support reliable store with specified executorId

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45463:
--

Assignee: Apache Spark

> Allow ShuffleDriverComponent to support reliable store with specified 
> executorId
> 
>
> Key: SPARK-45463
> URL: https://issues.apache.org/jira/browse/SPARK-45463
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.1
>Reporter: zhoubin
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>
> After SPARK-42689, ShuffleDriverComponent.supportsReliableStorage is 
> determined globally.
> Downstream projects may have different shuffle policy(cause by cluster loads 
> or columnar support) for different stages, for example or Apache Uniffle with 
> Gluten or Apache Celeborn.
> In this situation, ShuffleDriverComponent should use the mapTrackerMaster to 
> decide weather support reliable storage by the specified executorId



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45463) Allow ShuffleDriverComponent to support reliable store with specified executorId

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-45463:
--

Assignee: (was: Apache Spark)

> Allow ShuffleDriverComponent to support reliable store with specified 
> executorId
> 
>
> Key: SPARK-45463
> URL: https://issues.apache.org/jira/browse/SPARK-45463
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 4.0.0, 3.5.1
>Reporter: zhoubin
>Priority: Major
>  Labels: pull-request-available
>
> After SPARK-42689, ShuffleDriverComponent.supportsReliableStorage is 
> determined globally.
> Downstream projects may have different shuffle policy(cause by cluster loads 
> or columnar support) for different stages, for example or Apache Uniffle with 
> Gluten or Apache Celeborn.
> In this situation, ShuffleDriverComponent should use the mapTrackerMaster to 
> decide weather support reliable storage by the specified executorId



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45424) Regression in CSV schema inference when timestamps do not match specified timestampFormat

2023-10-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-45424:


Assignee: Jia Fan

> Regression in CSV schema inference when timestamps do not match specified 
> timestampFormat
> -
>
> Key: SPARK-45424
> URL: https://issues.apache.org/jira/browse/SPARK-45424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Andy Grove
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
>
> There is a regression in Spark 3.5.0 when inferring the schema of CSV files 
> containing timestamps, where a column will be inferred as a timestamp even if 
> the contents do not match the specified timestampFormat.
> *Test Data*
> I have the following CSV file:
> {code:java}
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> {code}
> *Spark 3.4.0 Behavior (correct)*
> In Spark 3.4.0, if I specify the correct timestamp format, then the schema is 
> inferred as timestamp:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> If I specify an incompatible timestampFormat, then the schema is inferred as 
> string:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}
> *Spark 3.5.0*
> In Spark 3.5.0, the column will be inferred as timestamp even if the data 
> does not match the specified timestampFormat.
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> Reading the DataFrame then results in an error:
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text 
> '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45424) Regression in CSV schema inference when timestamps do not match specified timestampFormat

2023-10-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-45424.
--
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43245
[https://github.com/apache/spark/pull/43245]

> Regression in CSV schema inference when timestamps do not match specified 
> timestampFormat
> -
>
> Key: SPARK-45424
> URL: https://issues.apache.org/jira/browse/SPARK-45424
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Andy Grove
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> There is a regression in Spark 3.5.0 when inferring the schema of CSV files 
> containing timestamps, where a column will be inferred as a timestamp even if 
> the contents do not match the specified timestampFormat.
> *Test Data*
> I have the following CSV file:
> {code:java}
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> 2884-06-24T02:45:51.138
> {code}
> *Spark 3.4.0 Behavior (correct)*
> In Spark 3.4.0, if I specify the correct timestamp format, then the schema is 
> inferred as timestamp:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss.SSS").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> If I specify an incompatible timestampFormat, then the schema is inferred as 
> string:
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: string]
> {code}
> *Spark 3.5.0*
> In Spark 3.5.0, the column will be inferred as timestamp even if the data 
> does not match the specified timestampFormat.
> {code:java}
> scala> val df = spark.read.option("timestampFormat", 
> "-MM-dd'T'HH:mm:ss").option("inferSchema", 
> true).csv("/tmp/timestamps.csv")
> df: org.apache.spark.sql.DataFrame = [_c0: timestamp]
> {code}
> Reading the DataFrame then results in an error:
> {code:java}
> Caused by: java.time.format.DateTimeParseException: Text 
> '2884-06-24T02:45:51.138' could not be parsed, unparsed text found at index 19
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45459) Remove the last 2 extra spaces in the automatically generated `sql-error-conditions.md` file

2023-10-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-45459:


Assignee: BingKun Pan

> Remove the last 2 extra spaces in the automatically generated 
> `sql-error-conditions.md` file
> 
>
> Key: SPARK-45459
> URL: https://issues.apache.org/jira/browse/SPARK-45459
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45459) Remove the last 2 extra spaces in the automatically generated `sql-error-conditions.md` file

2023-10-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-45459.
--
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43274
[https://github.com/apache/spark/pull/43274]

> Remove the last 2 extra spaces in the automatically generated 
> `sql-error-conditions.md` file
> 
>
> Key: SPARK-45459
> URL: https://issues.apache.org/jira/browse/SPARK-45459
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL, Tests
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45200) Spark 3.4.0 always using default log4j profile

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45200:
---
Labels: pull-request-available  (was: )

> Spark 3.4.0 always using default log4j profile
> --
>
> Key: SPARK-45200
> URL: https://issues.apache.org/jira/browse/SPARK-45200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Jitin Dominic
>Priority: Major
>  Labels: pull-request-available
>
> I've been using Spark core 3.2.2 and was upgrading to 3.4.0
> On execution of my Java code with the 3.4.0,  it generates some extra set of 
> logs but don't face this issue with 3.2.2.
>  
> I noticed that logs says _Using Spark's default log4j profile: 
> org/apache/spark/log4j2-defaults.properties._
>  
> Is this a bug or do we have a  a configuration to disable the using of 
> default log4j profile?
> I didn't see anything in the documentation
>  
> +*Output:*+
> {code:java}
> Using Spark's default log4j profile: 
> org/apache/spark/log4j2-defaults.properties
> 23/09/18 20:05:08 INFO SparkContext: Running Spark version 3.4.0
> 23/09/18 20:05:08 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 23/09/18 20:05:08 INFO ResourceUtils: 
> ==
> 23/09/18 20:05:08 INFO ResourceUtils: No custom resources configured for 
> spark.driver.
> 23/09/18 20:05:08 INFO ResourceUtils: 
> ==
> 23/09/18 20:05:08 INFO SparkContext: Submitted application: XYZ
> 23/09/18 20:05:08 INFO ResourceProfile: Default ResourceProfile created, 
> executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , 
> memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: 
> offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: 
> cpus, amount: 1.0)
> 23/09/18 20:05:08 INFO ResourceProfile: Limiting resource is cpu
> 23/09/18 20:05:08 INFO ResourceProfileManager: Added ResourceProfile id: 0
> 23/09/18 20:05:08 INFO SecurityManager: Changing view acls to: jd
> 23/09/18 20:05:08 INFO SecurityManager: Changing modify acls to: jd
> 23/09/18 20:05:08 INFO SecurityManager: Changing view acls groups to: 
> 23/09/18 20:05:08 INFO SecurityManager: Changing modify acls groups to: 
> 23/09/18 20:05:08 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: jd; groups with view 
> permissions: EMPTY; users with modify permissions: jd; groups with modify 
> permissions: EMPTY
> 23/09/18 20:05:08 INFO Utils: Successfully started service 'sparkDriver' on 
> port 39155.
> 23/09/18 20:05:08 INFO SparkEnv: Registering MapOutputTracker
> 23/09/18 20:05:08 INFO SparkEnv: Registering BlockManagerMaster
> 23/09/18 20:05:08 INFO BlockManagerMasterEndpoint: Using 
> org.apache.spark.storage.DefaultTopologyMapper for getting topology 
> information
> 23/09/18 20:05:08 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint 
> up
> 23/09/18 20:05:08 INFO SparkEnv: Registering BlockManagerMasterHeartbeat
> 23/09/18 20:05:08 INFO DiskBlockManager: Created local directory at 
> /tmp/blockmgr-226d599c-1511-4fae-b0e7-aae81b684012
> 23/09/18 20:05:08 INFO MemoryStore: MemoryStore started with capacity 2004.6 
> MiB
> 23/09/18 20:05:08 INFO SparkEnv: Registering OutputCommitCoordinator
> 23/09/18 20:05:08 INFO JettyUtils: Start Jetty 0.0.0.0:4040 for SparkUI
> 23/09/18 20:05:09 INFO Utils: Successfully started service 'SparkUI' on port 
> 4040.
> 23/09/18 20:05:09 INFO Executor: Starting executor ID driver on host jd
> 23/09/18 20:05:09 INFO Executor: Starting executor with user classpath 
> (userClassPathFirst = false): ''
> 23/09/18 20:05:09 INFO Utils: Successfully started service 
> 'org.apache.spark.network.netty.NettyBlockTransferService' on port 32819.
> 23/09/18 20:05:09 INFO NettyBlockTransferService: Server created on jd:32819
> 23/09/18 20:05:09 INFO BlockManager: Using 
> org.apache.spark.storage.RandomBlockReplicationPolicy for block replication 
> policy
> 23/09/18 20:05:09 INFO BlockManagerMaster: Registering BlockManager 
> BlockManagerId(driver, jd, 32819, None)
> 23/09/18 20:05:09 INFO BlockManagerMasterEndpoint: Registering block manager 
> jd:32819 with 2004.6 MiB RAM, BlockManagerId(driver, jd, 32819, None)
> 23/09/18 20:05:09 INFO BlockManagerMaster: Registered BlockManager 
> BlockManagerId(driver, jd, 32819, None)
> 23/09/18 20:05:09 INFO BlockManager: Initialized BlockManager: 
> BlockManagerId(driver, jd, 32819, None)
> {code}
>  
>  
>  
> I'm using Spark core dependency in one of my Jars, the jar code contains the 
> following:
>  
> *+Code:+*
> {code:java}
> S

[jira] [Commented] (SPARK-44988) Parquet INT64 (TIMESTAMP(NANOS,false)) throwing Illegal Parquet type

2023-10-09 Thread Miles Granger (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773215#comment-17773215
 ] 

Miles Granger commented on SPARK-44988:
---

[~fanjia]that "worked" for me, but then of course need to cast the resulting 
bigint to a timestamp, which I feel is error prone. Would be nice if spark 
supported timestamp[ns] though.

> Parquet INT64 (TIMESTAMP(NANOS,false)) throwing Illegal Parquet type
> 
>
> Key: SPARK-44988
> URL: https://issues.apache.org/jira/browse/SPARK-44988
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0, 3.4.1
>Reporter: Flavio Odas
>Priority: Critical
>
> This bug seems similar to https://issues.apache.org/jira/browse/SPARK-40819, 
> except that it's a problem with INT64 (TIMESTAMP(NANOS,false)), instead of 
> INT64 (TIMESTAMP(NANOS,true)).
> The error happens whenever I'm trying to read:
> {code:java}
> org.apache.spark.sql.AnalysisException: Illegal Parquet type: INT64 
> (TIMESTAMP(NANOS,false)).
>   at 
> org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1762)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:206)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertPrimitiveField$2(ParquetSchemaConverter.scala:283)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:224)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:187)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3(ParquetSchemaConverter.scala:147)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convertInternal$3$adapted(ParquetSchemaConverter.scala:117)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at scala.collection.immutable.Range.foreach(Range.scala:158)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertInternal(ParquetSchemaConverter.scala:117)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:87)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:493)
>   at scala.Option.getOrElse(Option.scala:189)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:493)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:473)
>   at scala.collection.immutable.Stream.map(Stream.scala:418)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:473)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:464)
>   at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:79)
>   at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:853)
>   at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2$adapted(RDD.scala:853)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:364)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:328)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:92)
>   at 
> org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161)
>   at org.apache.spark.scheduler.Task.run(Task.scala:139)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:554)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1529)
>   at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:557) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsu

[jira] [Created] (SPARK-45470) Avoid paste string value of hive orc compression kind

2023-10-09 Thread Jiaan Geng (Jira)
Jiaan Geng created SPARK-45470:
--

 Summary: Avoid paste string value of hive orc compression kind
 Key: SPARK-45470
 URL: https://issues.apache.org/jira/browse/SPARK-45470
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Jiaan Geng


Currently, hive supports orc format with some compression codec( Please refer 
org.apache.hadoop.hive.ql.io.orc.CompressionKind).
Spark pasted many string name of these compression codec.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45470) Avoid paste string value of hive orc compression kind

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45470:
---
Labels: pull-request-available  (was: )

> Avoid paste string value of hive orc compression kind
> -
>
> Key: SPARK-45470
> URL: https://issues.apache.org/jira/browse/SPARK-45470
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
>
> Currently, hive supports orc format with some compression codec( Please refer 
> org.apache.hadoop.hive.ql.io.orc.CompressionKind).
> Spark pasted many string name of these compression codec.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42034) QueryExecutionListener and Observation API, df.observe do not work with `foreach` action.

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-42034:
---
Labels: pull-request-available sql-api  (was: sql-api)

> QueryExecutionListener and Observation API, df.observe do not work with 
> `foreach` action.
> -
>
> Key: SPARK-42034
> URL: https://issues.apache.org/jira/browse/SPARK-42034
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.3, 3.2.2, 3.3.1
> Environment: I test it locally and on YARN in cluster mode.
> Spark 3.3.1 and 3.2.2 and 3.1.1.
> Yarn 2.9.2 and 3.2.1.
>Reporter: Nick Hryhoriev
>Assignee: ming95
>Priority: Major
>  Labels: pull-request-available, sql-api
> Fix For: 3.5.0
>
>
> Observation API, {{observe}} dataframe transformation, and custom 
> QueryExecutionListener.
> Do not work with {{foreach}} or {{foreachPartition actions.}}
> {{This is due to }}QueryExecutionListener functions do not trigger on queries 
> whose action is {{foreach}} or {{{}foreachPartition{}}}.
> But the Spark GUI SQL tab sees this query as SQL query and shows its query 
> plans and etc.
> here is the code to reproduce it:
> https://gist.github.com/GrigorievNick/e7cf9ec5584b417d9719e2812722e6d3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45471) Allow specifying ResourceProfile in Dataset API

2023-10-09 Thread Keio Kraaner (Jira)
Keio Kraaner created SPARK-45471:


 Summary: Allow specifying ResourceProfile in Dataset API
 Key: SPARK-45471
 URL: https://issues.apache.org/jira/browse/SPARK-45471
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.2
Reporter: Keio Kraaner






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42132) DeduplicateRelations rule breaks plan when co-grouping the same DataFrame

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-42132:
---
Labels: correctness pull-request-available  (was: correctness)

> DeduplicateRelations rule breaks plan when co-grouping the same DataFrame
> -
>
> Key: SPARK-42132
> URL: https://issues.apache.org/jira/browse/SPARK-42132
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.3, 3.3.0, 3.3.1, 3.2.3, 3.4.0, 3.5.0
>Reporter: Enrico Minack
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 3.5.0
>
>
> Co-grouping two DataFrames that share references breaks on the 
> DeduplicateRelations rule:
> {code:java}
> val df = spark.range(3)
> val left_grouped_df = df.groupBy("id").as[Long, Long]
> val right_grouped_df = df.groupBy("id").as[Long, Long]
> val cogroup_df = left_grouped_df.cogroup(right_grouped_df) {
>   case (key, left, right) => left
> }
> cogroup_df.explain()
> {code}
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- SerializeFromObject [input[0, bigint, false] AS value#12L]
>+- CoGroup, id#0: bigint, id#0: bigint, id#0: bigint, [id#13L], [id#13L], 
> [id#13L], [id#13L], obj#11: bigint
>   :- !Sort [id#13L ASC NULLS FIRST], false, 0
>   :  +- !Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=16]
>   : +- Range (0, 3, step=1, splits=16)
>   +- Sort [id#13L ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(id#13L, 200), ENSURE_REQUIREMENTS, 
> [plan_id=17]
> +- Range (0, 3, step=1, splits=16)
> {code}
> The DataFrame cannot be computed:
> {code:java}
> cogroup_df.show()
> {code}
> {code:java}
> java.lang.IllegalStateException: Couldn't find id#13L in [id#0L]
> {code}
> The rule replaces `id#0L` on the right side with `id#13L` while replacing all 
> occurrences in `CoGroup`. Some occurrences of `id#0L` in `CoGroup`refer to 
> the left side and should not be replaced. Further, `id#0L` of the right 
> deserializer is not replaced.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45468) More transparent proxy handling for HTTP redirects

2023-10-09 Thread Nobuaki Sukegawa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nobuaki Sukegawa updated SPARK-45468:
-
Summary: More transparent proxy handling for HTTP redirects  (was: Add 
option to use path without hostname for redirects)

> More transparent proxy handling for HTTP redirects
> --
>
> Key: SPARK-45468
> URL: https://issues.apache.org/jira/browse/SPARK-45468
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.5.0
>Reporter: Nobuaki Sukegawa
>Priority: Major
>
> Currently, proxies can be made transparent for hyperlinks in Spark web UIs 
> with spark.ui.proxyRoot or X-Forwarded-Context header. However, HTTP 
> redirects (such as job/stage kill) currently requires explicit 
> spark.ui.proxyRedirectUri for handling proxy. This is not ideal as proxy 
> hostname may not be known at the time configuring Spark apps.
> This can be mitigated by using path without hostname (/jobs/, not 
> https://example.com/jobs/). Then redirects behavior would be basically the 
> same way as other hyperlinks.
> While hostname was originally required in RFC 2616 in 1999, since RFC 7231 in 
> 2014 hostname can be formally omitted as most browsers already supported it 
> (it is rather hard to find any browser that doesn't support it).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45468) More transparent proxy handling for HTTP redirects

2023-10-09 Thread Nobuaki Sukegawa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nobuaki Sukegawa updated SPARK-45468:
-
Description: 
Currently, proxies can be made transparent for hyperlinks in Spark web UIs with 
spark.ui.proxyRoot or X-Forwarded-Context header alone. However, HTTP redirects 
(such as job/stage kill) currently requires explicit spark.ui.proxyRedirectUri 
as well for handling proxy. This is not ideal as proxy hostname may not be 
known at the time configuring Spark apps.

This can be mitigated by 1) always prepending spark.ui.proxyRoot to redirect 
path for those proxies that intelligently rewrite Location headers and 2) by 
using path without hostname (/jobs/, not https://example.com/jobs/) for those 
proxies without Location header rewrites. Then redirects behavior would be 
basically the same way as other hyperlinks.

Regarding 2), while hostname was originally required in RFC 2616 in 1999, since 
RFC 7231 in 2014 hostname can be formally omitted as most browsers already 
supported it (it is rather hard to find any browser that doesn't support it).

  was:
Currently, proxies can be made transparent for hyperlinks in Spark web UIs with 
spark.ui.proxyRoot or X-Forwarded-Context header. However, HTTP redirects (such 
as job/stage kill) currently requires explicit spark.ui.proxyRedirectUri for 
handling proxy. This is not ideal as proxy hostname may not be known at the 
time configuring Spark apps.

This can be mitigated by using path without hostname (/jobs/, not 
https://example.com/jobs/). Then redirects behavior would be basically the same 
way as other hyperlinks.

While hostname was originally required in RFC 2616 in 1999, since RFC 7231 in 
2014 hostname can be formally omitted as most browsers already supported it (it 
is rather hard to find any browser that doesn't support it).


> More transparent proxy handling for HTTP redirects
> --
>
> Key: SPARK-45468
> URL: https://issues.apache.org/jira/browse/SPARK-45468
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.5.0
>Reporter: Nobuaki Sukegawa
>Priority: Major
>
> Currently, proxies can be made transparent for hyperlinks in Spark web UIs 
> with spark.ui.proxyRoot or X-Forwarded-Context header alone. However, HTTP 
> redirects (such as job/stage kill) currently requires explicit 
> spark.ui.proxyRedirectUri as well for handling proxy. This is not ideal as 
> proxy hostname may not be known at the time configuring Spark apps.
> This can be mitigated by 1) always prepending spark.ui.proxyRoot to redirect 
> path for those proxies that intelligently rewrite Location headers and 2) by 
> using path without hostname (/jobs/, not https://example.com/jobs/) for those 
> proxies without Location header rewrites. Then redirects behavior would be 
> basically the same way as other hyperlinks.
> Regarding 2), while hostname was originally required in RFC 2616 in 1999, 
> since RFC 7231 in 2014 hostname can be formally omitted as most browsers 
> already supported it (it is rather hard to find any browser that doesn't 
> support it).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45468) More transparent proxy handling for HTTP redirects

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45468:
---
Labels: pull-request-available  (was: )

> More transparent proxy handling for HTTP redirects
> --
>
> Key: SPARK-45468
> URL: https://issues.apache.org/jira/browse/SPARK-45468
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.5.0
>Reporter: Nobuaki Sukegawa
>Priority: Major
>  Labels: pull-request-available
>
> Currently, proxies can be made transparent for hyperlinks in Spark web UIs 
> with spark.ui.proxyRoot or X-Forwarded-Context header alone. However, HTTP 
> redirects (such as job/stage kill) currently requires explicit 
> spark.ui.proxyRedirectUri as well for handling proxy. This is not ideal as 
> proxy hostname may not be known at the time configuring Spark apps.
> This can be mitigated by 1) always prepending spark.ui.proxyRoot to redirect 
> path for those proxies that intelligently rewrite Location headers and 2) by 
> using path without hostname (/jobs/, not https://example.com/jobs/) for those 
> proxies without Location header rewrites. Then redirects behavior would be 
> basically the same way as other hyperlinks.
> Regarding 2), while hostname was originally required in RFC 2616 in 1999, 
> since RFC 7231 in 2014 hostname can be formally omitted as most browsers 
> already supported it (it is rather hard to find any browser that doesn't 
> support it).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45201) NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0

2023-10-09 Thread Nobuaki Sukegawa (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773360#comment-17773360
 ] 

Nobuaki Sukegawa commented on SPARK-45201:
--

We've experienced the same issue and resolved it in the same way (by removing 
the connect common JAR and applying the patch).

For some reason the issue did not always reproduce. Using the same container 
image on a same Kubernetes cluster, it seems that it only happens on certain 
nodes. I suspect that it is because of the use of wildcard in Spark classpath 
that JVM probably resolves to the actual filepaths using system call without 
any guaranteed ordering in the result (just a guess from the behavior).

> NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0
> 
>
> Key: SPARK-45201
> URL: https://issues.apache.org/jira/browse/SPARK-45201
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Sebastian Daberdaku
>Priority: Major
> Attachments: Dockerfile, spark-3.5.0.patch
>
>
> I am trying to compile Spark 3.5.0 and make a distribution that supports 
> Spark Connect and Kubernetes. The compilation seems to complete correctly, 
> but when I try to run the Spark Connect server on kubernetes I get a 
> "NoClassDefFoundError" as follows:
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079)
>     at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011)
>     at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034)
>     at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
>     at 
> org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146)
>     at 
> org.apache.spark.storage.B

[jira] [Updated] (SPARK-45383) Missing case for RelationTimeTravel in CheckAnalysis

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45383:
---
Labels: pull-request-available  (was: )

> Missing case for RelationTimeTravel in CheckAnalysis
> 
>
> Key: SPARK-45383
> URL: https://issues.apache.org/jira/browse/SPARK-45383
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Ryan Johnson
>Priority: Major
>  Labels: pull-request-available
>
> {{CheckAnalysis.checkAnalysis0}} lacks a case for {{{}RelationTimeTravel{}}}, 
> and since the latter is (intentionally) an {{UnresolvedLeafNode}} rather than 
> a {{{}UnaryNode{}}}, the existing checks do not traverse it.
> Result: Attempting time travel over a non-existing table produces a spark 
> internal error from the [default 
> case|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L818],
>  rather than the expected {{{}AnalysisException{}}}:
> {code:java}
> [info]   Cause: org.apache.spark.SparkException: [INTERNAL_ERROR] Found the 
> unresolved operator: 'RelationTimeTravel 'UnresolvedRelation [not_exists], 
> [], false, 0
> [info]   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:77)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$54(CheckAnalysis.scala:753)
>  {code}
> Fix should be simple enough:
> {code:java}
> case tt: RelationTimeTravel =>
>   checkAnalysis0(tt.table) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45201) NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0

2023-10-09 Thread Sebastian Daberdaku (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773397#comment-17773397
 ] 

Sebastian Daberdaku commented on SPARK-45201:
-

Hello [~nsuke], 
I am happy my patch was useful to you!
The JVM class loader inconsistency seems to be a very plausible cause, I 
experienced something similar with my docker image working locally (with 
docker) but not on my EKS cluster (to be clear, same docker image).

> NoClassDefFoundError: InternalFutureFailureAccess when compiling Spark 3.5.0
> 
>
> Key: SPARK-45201
> URL: https://issues.apache.org/jira/browse/SPARK-45201
> Project: Spark
>  Issue Type: Bug
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Sebastian Daberdaku
>Priority: Major
> Attachments: Dockerfile, spark-3.5.0.patch
>
>
> I am trying to compile Spark 3.5.0 and make a distribution that supports 
> Spark Connect and Kubernetes. The compilation seems to complete correctly, 
> but when I try to run the Spark Connect server on kubernetes I get a 
> "NoClassDefFoundError" as follows:
> {code:java}
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/sparkproject/guava/util/concurrent/internal/InternalFutureFailureAccess
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at java.base/java.lang.ClassLoader.defineClass1(Native Method)
>     at java.base/java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>     at 
> java.base/java.security.SecureClassLoader.defineClass(SecureClassLoader.java:150)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.defineClass(BuiltinClassLoader.java:862)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.findClassOnClassPathOrNull(BuiltinClassLoader.java:760)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClassOrNull(BuiltinClassLoader.java:681)
>     at 
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:639)
>     at 
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
>     at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3511)
>     at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.(LocalCache.java:3515)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2168)
>     at 
> org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2079)
>     at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4011)
>     at org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4034)
>     at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
>     at 
> org.apache.spark.storage.BlockManagerId$.getCachedBlockManagerId(BlockManagerId.scala:146)
>     at 
> org.apache.spark.storage.BlockManagerId$.apply(BlockManagerId.scala:127)
>     at 
> org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:536)
>     at org.apache.spark.SparkContext.(SparkContext.scala:625)
>     at org.apache.spark.SparkContext$

[jira] [Created] (SPARK-45472) RocksDB State Store Doesn't Need to Recheck checkpoint path existence

2023-10-09 Thread Siying Dong (Jira)
Siying Dong created SPARK-45472:
---

 Summary: RocksDB State Store Doesn't Need to Recheck checkpoint 
path existence
 Key: SPARK-45472
 URL: https://issues.apache.org/jira/browse/SPARK-45472
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 3.5.1
Reporter: Siying Dong


Right now, every time RocksDB.load() is called, we check checkpoint directory 
existence and create it if not. This is relatively expensive and show up in 
performance profiling. We don't need to do it the second time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45472) RocksDB State Store Doesn't Need to Recheck checkpoint path existence

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45472:
---
Labels: pull-request-available  (was: )

> RocksDB State Store Doesn't Need to Recheck checkpoint path existence
> -
>
> Key: SPARK-45472
> URL: https://issues.apache.org/jira/browse/SPARK-45472
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Siying Dong
>Priority: Minor
>  Labels: pull-request-available
>
> Right now, every time RocksDB.load() is called, we check checkpoint directory 
> existence and create it if not. This is relatively expensive and show up in 
> performance profiling. We don't need to do it the second time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44919) Avro connector: convert a union of a single primitive type to a StructType

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44919:
---
Labels: pull-request-available  (was: )

> Avro connector: convert a union of a single primitive type to a StructType
> --
>
> Key: SPARK-44919
> URL: https://issues.apache.org/jira/browse/SPARK-44919
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.1
>Reporter: Tianhan Hu
>Priority: Major
>  Labels: pull-request-available
>
> Spark Avro data source schema converter currently converts union with a 
> single primitive type to a Spark primitive type instead of a StructType.
> While for more complex union types that consists of multiple primitive types, 
> the schema converter translate them into StructTypes.
> For example, 
> import scala.collection.JavaConverters._
> import org.apache.avro._
> import org.apache.spark.sql.avro._
> // ["string", "null"]
> SchemaConverters.toSqlType(
>   Schema.createUnion(Seq(Schema.create(Schema.Type.STRING), 
> Schema.create(Schema.Type.NULL)).asJava)
> ).dataType
> // ["string", "int", "null"]
> SchemaConverters.toSqlType(
>   Schema.createUnion(Seq(Schema.create(Schema.Type.STRING), 
> Schema.create(Schema.Type.INT), Schema.create(Schema.Type.NULL)).asJava)
> ).dataType
> The first one would return StringType, the second would return 
> StructType(StringType, IntegerType).
>  
> We hope to add a new configuration to control the conversion behavior. The 
> default behavior would still be the same. When the config is altered, a union 
> with single primitive type would be translated into StructType.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45383) Missing case for RelationTimeTravel in CheckAnalysis

2023-10-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-45383.
--
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43298
[https://github.com/apache/spark/pull/43298]

> Missing case for RelationTimeTravel in CheckAnalysis
> 
>
> Key: SPARK-45383
> URL: https://issues.apache.org/jira/browse/SPARK-45383
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Ryan Johnson
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> {{CheckAnalysis.checkAnalysis0}} lacks a case for {{{}RelationTimeTravel{}}}, 
> and since the latter is (intentionally) an {{UnresolvedLeafNode}} rather than 
> a {{{}UnaryNode{}}}, the existing checks do not traverse it.
> Result: Attempting time travel over a non-existing table produces a spark 
> internal error from the [default 
> case|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L818],
>  rather than the expected {{{}AnalysisException{}}}:
> {code:java}
> [info]   Cause: org.apache.spark.SparkException: [INTERNAL_ERROR] Found the 
> unresolved operator: 'RelationTimeTravel 'UnresolvedRelation [not_exists], 
> [], false, 0
> [info]   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:77)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$54(CheckAnalysis.scala:753)
>  {code}
> Fix should be simple enough:
> {code:java}
> case tt: RelationTimeTravel =>
>   checkAnalysis0(tt.table) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45383) Missing case for RelationTimeTravel in CheckAnalysis

2023-10-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-45383:


Assignee: Wenchen Fan

> Missing case for RelationTimeTravel in CheckAnalysis
> 
>
> Key: SPARK-45383
> URL: https://issues.apache.org/jira/browse/SPARK-45383
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Ryan Johnson
>Assignee: Wenchen Fan
>Priority: Major
>  Labels: pull-request-available
>
> {{CheckAnalysis.checkAnalysis0}} lacks a case for {{{}RelationTimeTravel{}}}, 
> and since the latter is (intentionally) an {{UnresolvedLeafNode}} rather than 
> a {{{}UnaryNode{}}}, the existing checks do not traverse it.
> Result: Attempting time travel over a non-existing table produces a spark 
> internal error from the [default 
> case|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L818],
>  rather than the expected {{{}AnalysisException{}}}:
> {code:java}
> [info]   Cause: org.apache.spark.SparkException: [INTERNAL_ERROR] Found the 
> unresolved operator: 'RelationTimeTravel 'UnresolvedRelation [not_exists], 
> [], false, 0
> [info]   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:77)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$54(CheckAnalysis.scala:753)
>  {code}
> Fix should be simple enough:
> {code:java}
> case tt: RelationTimeTravel =>
>   checkAnalysis0(tt.table) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44729) Add canonical links to the PySpark docs page

2023-10-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-44729:
--
Fix Version/s: 3.4.2

> Add canonical links to the PySpark docs page
> 
>
> Key: SPARK-44729
> URL: https://issues.apache.org/jira/browse/SPARK-44729
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 3.5.0, 4.0.0
>
>
> We should add the canonical link to the PySpark docs page 
> [https://spark.apache.org/docs/latest/api/python/index.html] so that the 
> search engine can return the latest PySpark docs.
> Then, we need to update all released documentation pages to add the canonical 
> link pointing to the latest spark documentation of the API (such as group 
> by). Currently, if you Google pyspark groupby, Google will return the docs 
> page from 3.1.1, which is not ideal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44729) Add canonical links to the PySpark docs page

2023-10-09 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773473#comment-17773473
 ] 

Dongjoon Hyun commented on SPARK-44729:
---

This landed at branch-3.4 via https://github.com/apache/spark/pull/43285

> Add canonical links to the PySpark docs page
> 
>
> Key: SPARK-44729
> URL: https://issues.apache.org/jira/browse/SPARK-44729
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 3.5.0, 4.0.0
>
>
> We should add the canonical link to the PySpark docs page 
> [https://spark.apache.org/docs/latest/api/python/index.html] so that the 
> search engine can return the latest PySpark docs.
> Then, we need to update all released documentation pages to add the canonical 
> link pointing to the latest spark documentation of the API (such as group 
> by). Currently, if you Google pyspark groupby, Google will return the docs 
> page from 3.1.1, which is not ideal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45456) Upgrade maven to 3.9.5

2023-10-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45456.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43267
[https://github.com/apache/spark/pull/43267]

> Upgrade maven to 3.9.5
> --
>
> Key: SPARK-45456
> URL: https://issues.apache.org/jira/browse/SPARK-45456
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44729) Add canonical links to the PySpark docs page

2023-10-09 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773474#comment-17773474
 ] 

Dongjoon Hyun commented on SPARK-44729:
---

This landed at branch-3.3 via https://github.com/apache/spark/pull/43286

> Add canonical links to the PySpark docs page
> 
>
> Key: SPARK-44729
> URL: https://issues.apache.org/jira/browse/SPARK-44729
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: BingKun Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 3.5.0, 4.0.0
>
>
> We should add the canonical link to the PySpark docs page 
> [https://spark.apache.org/docs/latest/api/python/index.html] so that the 
> search engine can return the latest PySpark docs.
> Then, we need to update all released documentation pages to add the canonical 
> link pointing to the latest spark documentation of the API (such as group 
> by). Currently, if you Google pyspark groupby, Google will return the docs 
> page from 3.1.1, which is not ideal.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization

2023-10-09 Thread Eren Avsarogullari (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773480#comment-17773480
 ] 

Eren Avsarogullari commented on SPARK-45443:


> But this idea only work for one query
Could you please provide more info on this? For example, same IMR instance can 
be used by multiple queries. Lets say, there are 2 queries as Q0 & Q1 and both 
of them use same IMR instance. Q0 will be materializing IMR instance by 
TableCacheQueryStage and IMR materialization has to be done before Q0 is 
completed. Q1 can still introduce TableCacheQueryStage instance to physical 
plan for same IMR instance, however, this TableCacheQueryStage instance will 
not submit IMR materialization job due to IMR already materialized at that 
level, right?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281
Can we have any other use-cases not covered?

Also, thinking on following use-cases such as 
*1- Queries using AQE under IMR feature (a.k.a 
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning):* 
TableCacheQueryStage materializes IMR by submitting Spark job per 
TableCacheQueryStage/InMemoryTableScanExec instance. 

*2- Queries not using AQE under IMR feature:* 
IMR will be materialized by 
InMemoryTableScanExec.doExecute/doExecuteColumnar(). Can InMemoryTableScanExec 
based solution (to avoid replicated InMemoryRelation materialization) be more 
inclusive by covering all use-cases?
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning is enabled for Spark 
3.5 as default but for queries using < Spark 3.5 or if the feature may need to 
be disabled in >= Spark 3.5.

> Revisit TableCacheQueryStage to avoid replicated InMemoryRelation 
> materialization
> -
>
> Key: SPARK-45443
> URL: https://issues.apache.org/jira/browse/SPARK-45443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: IMR Materialization - Stage 2.png, IMR Materialization - 
> Stage 3.png
>
>
> TableCacheQueryStage is created per InMemoryTableScanExec by 
> AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output 
> (cached RDD) to provide runtime stats in order to apply AQE  optimizations 
> into remaining physical plan stages. TableCacheQueryStage materializes 
> InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage 
> instance. For example, if there are 2 TableCacheQueryStage instances 
> referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s 
> materialization takes longer, following logic will return false 
> (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR 
> materialization. This behavior can be more visible when cached RDD size is 
> high.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281]
> Would like to get community feedback. Thanks in advance.
> cc [~ulysses] [~cloud_fan]
> *Sample Query to simulate the problem:*
> // Both join legs uses same IMR instance
> {code:java}
> import spark.implicits._
> val arr = (1 to 12).map { i => {
> val index = i % 5
> (index, s"Employee_$index", s"Department_$index")
>   }
> }
> val df = arr.toDF("id", "name", "department")
>   .filter('id >= 0)
>   .sort("id")
>   .groupBy('id, 'name, 'department)
>   .count().as("count")
> df.persist()
> val df2 = df.sort("count").filter('count <= 2)
> val df3 = df.sort("count").filter('count >= 3)
> val df4 = df2.join(df3, Seq("id", "name", "department"), "fullouter")
> df4.show() {code}
> *Physical Plan:*
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (31)
> +- == Final Plan ==
>    CollectLimit (21)
>    +- * Project (20)
>       +- * SortMergeJoin FullOuter (19)
>          :- * Sort (10)
>          :  +- * Filter (9)
>          :     +- TableCacheQueryStage (8), Statistics(sizeInBytes=210.0 B, 
> rowCount=5)
>          :        +- InMemoryTableScan (1)
>          :              +- InMemoryRelation (2)
>          :                    +- AdaptiveSparkPlan (7)
>          :                       +- HashAggregate (6)
>          :                          +- Exchange (5)
>          :                             +- HashAggregate (4)
>          :                                +- LocalTableScan (3)
>          +- * Sort (18)
>             +- * Filter (17)
>                +- TableCacheQueryStage (16), Statistics(sizeInBytes=210.0 B, 
> rowCount=5)
>                   +- InMemoryTableScan (11)
>                         +- InMemoryRelation (12)
>                               +- AdaptiveSparkPlan (15)
>                                  +

[jira] [Comment Edited] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization

2023-10-09 Thread Eren Avsarogullari (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773480#comment-17773480
 ] 

Eren Avsarogullari edited comment on SPARK-45443 at 10/9/23 9:16 PM:
-

> But this idea only work for one query
Could you please provide more info on this? For example, same IMR instance can 
be used by multiple queries. Lets say, there are 2 queries as Q0 & Q1 and both 
of them use same IMR instance. Q0 will be materializing IMR instance by 
TableCacheQueryStage and IMR materialization has to be done before Q0 is 
completed. Q1 can still introduce TableCacheQueryStage instance to physical 
plan for same IMR instance, however, this TableCacheQueryStage instance will 
not submit IMR materialization job due to IMR already materialized at that 
level, right? Can we have any other use-cases not covered?
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281]



Also, thinking on following use-cases such as 
*1- Queries using AQE under IMR feature (a.k.a 
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning):* 
TableCacheQueryStage materializes IMR by submitting Spark job per 
TableCacheQueryStage/InMemoryTableScanExec instance. 

*2- Queries not using AQE under IMR feature:* 
IMR will be materialized by 
InMemoryTableScanExec.doExecute/doExecuteColumnar(). Can InMemoryTableScanExec 
based solution (to avoid replicated InMemoryRelation materialization) be more 
inclusive by covering all use-cases?
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning is enabled for Spark 
3.5 as default but for queries using < Spark 3.5 or if the feature may need to 
be disabled in >= Spark 3.5.


was (Author: erenavsarogullari):
> But this idea only work for one query
Could you please provide more info on this? For example, same IMR instance can 
be used by multiple queries. Lets say, there are 2 queries as Q0 & Q1 and both 
of them use same IMR instance. Q0 will be materializing IMR instance by 
TableCacheQueryStage and IMR materialization has to be done before Q0 is 
completed. Q1 can still introduce TableCacheQueryStage instance to physical 
plan for same IMR instance, however, this TableCacheQueryStage instance will 
not submit IMR materialization job due to IMR already materialized at that 
level, right?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281
Can we have any other use-cases not covered?

Also, thinking on following use-cases such as 
*1- Queries using AQE under IMR feature (a.k.a 
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning):* 
TableCacheQueryStage materializes IMR by submitting Spark job per 
TableCacheQueryStage/InMemoryTableScanExec instance. 

*2- Queries not using AQE under IMR feature:* 
IMR will be materialized by 
InMemoryTableScanExec.doExecute/doExecuteColumnar(). Can InMemoryTableScanExec 
based solution (to avoid replicated InMemoryRelation materialization) be more 
inclusive by covering all use-cases?
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning is enabled for Spark 
3.5 as default but for queries using < Spark 3.5 or if the feature may need to 
be disabled in >= Spark 3.5.

> Revisit TableCacheQueryStage to avoid replicated InMemoryRelation 
> materialization
> -
>
> Key: SPARK-45443
> URL: https://issues.apache.org/jira/browse/SPARK-45443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: IMR Materialization - Stage 2.png, IMR Materialization - 
> Stage 3.png
>
>
> TableCacheQueryStage is created per InMemoryTableScanExec by 
> AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output 
> (cached RDD) to provide runtime stats in order to apply AQE  optimizations 
> into remaining physical plan stages. TableCacheQueryStage materializes 
> InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage 
> instance. For example, if there are 2 TableCacheQueryStage instances 
> referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s 
> materialization takes longer, following logic will return false 
> (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR 
> materialization. This behavior can be more visible when cached RDD size is 
> high.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281]
> Would like to get community feedback. Thanks in advance.
> cc [~ulysses] [~cloud_fan]
> *Sample Query to simulate the problem:*
> // Both join legs uses same IMR 

[jira] [Comment Edited] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization

2023-10-09 Thread Eren Avsarogullari (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773480#comment-17773480
 ] 

Eren Avsarogullari edited comment on SPARK-45443 at 10/9/23 9:17 PM:
-

> But this idea only work for one query
Could you please provide more info on this? For example, same IMR instance can 
be used by multiple queries. Lets say, there are 2 queries as Q0 & Q1 and both 
of them use same IMR instance. Q0 will be materializing IMR instance by 
TableCacheQueryStage and IMR materialization has to be done before Q0 is 
completed. Q1 can still introduce TableCacheQueryStage instance to physical 
plan for same IMR instance, however, this TableCacheQueryStage instance will 
not submit IMR materialization job due to IMR already materialized at that 
level, right? Can we have any other use-cases not covered?
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281]

Also, thinking on following use-cases such as 
*1- Queries using AQE under IMR feature (a.k.a 
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning):* 
TableCacheQueryStage materializes IMR by submitting Spark job per 
TableCacheQueryStage/InMemoryTableScanExec instance. 

*2- Queries not using AQE under IMR feature:* 
IMR will be materialized by InMemoryTableScanExec.doExecute/doExecuteColumnar() 
flow. Can InMemoryTableScanExec based solution (to avoid replicated 
InMemoryRelation materialization) be more inclusive by covering all use-cases?
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning is enabled for Spark 
3.5 as default but for queries using < Spark 3.5 or if the feature may need to 
be disabled in >= Spark 3.5 for some reason.


was (Author: erenavsarogullari):
> But this idea only work for one query
Could you please provide more info on this? For example, same IMR instance can 
be used by multiple queries. Lets say, there are 2 queries as Q0 & Q1 and both 
of them use same IMR instance. Q0 will be materializing IMR instance by 
TableCacheQueryStage and IMR materialization has to be done before Q0 is 
completed. Q1 can still introduce TableCacheQueryStage instance to physical 
plan for same IMR instance, however, this TableCacheQueryStage instance will 
not submit IMR materialization job due to IMR already materialized at that 
level, right? Can we have any other use-cases not covered?
[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281]



Also, thinking on following use-cases such as 
*1- Queries using AQE under IMR feature (a.k.a 
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning):* 
TableCacheQueryStage materializes IMR by submitting Spark job per 
TableCacheQueryStage/InMemoryTableScanExec instance. 

*2- Queries not using AQE under IMR feature:* 
IMR will be materialized by 
InMemoryTableScanExec.doExecute/doExecuteColumnar(). Can InMemoryTableScanExec 
based solution (to avoid replicated InMemoryRelation materialization) be more 
inclusive by covering all use-cases?
spark.sql.optimizer.canChangeCachedPlanOutputPartitioning is enabled for Spark 
3.5 as default but for queries using < Spark 3.5 or if the feature may need to 
be disabled in >= Spark 3.5.

> Revisit TableCacheQueryStage to avoid replicated InMemoryRelation 
> materialization
> -
>
> Key: SPARK-45443
> URL: https://issues.apache.org/jira/browse/SPARK-45443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: IMR Materialization - Stage 2.png, IMR Materialization - 
> Stage 3.png
>
>
> TableCacheQueryStage is created per InMemoryTableScanExec by 
> AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output 
> (cached RDD) to provide runtime stats in order to apply AQE  optimizations 
> into remaining physical plan stages. TableCacheQueryStage materializes 
> InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage 
> instance. For example, if there are 2 TableCacheQueryStage instances 
> referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s 
> materialization takes longer, following logic will return false 
> (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR 
> materialization. This behavior can be more visible when cached RDD size is 
> high.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281]
> Would like to get community feedback. Thanks in advance.
> cc [~ulysses] [~cloud_fan]
> *Sample Query to simulate the problem:*
> // Both jo

[jira] [Updated] (SPARK-45221) Refine docstring of `DataFrameReader.parquet`

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45221:
---
Labels: pull-request-available  (was: )

> Refine docstring of `DataFrameReader.parquet`
> -
>
> Key: SPARK-45221
> URL: https://issues.apache.org/jira/browse/SPARK-45221
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Refine the docstring of read parquet



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45473) Incorrect error message for RoundBase

2023-10-09 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-45473:
---

 Summary: Incorrect error message for RoundBase
 Key: SPARK-45473
 URL: https://issues.apache.org/jira/browse/SPARK-45473
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.5.0, 3.4.1
Reporter: L. C. Hsieh






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-43299) JVM Client throw StreamingQueryException when error handling is implemented

2023-10-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-43299:


Assignee: Yihong He

> JVM Client throw StreamingQueryException when error handling is implemented
> ---
>
> Key: SPARK-43299
> URL: https://issues.apache.org/jira/browse/SPARK-43299
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Assignee: Yihong He
>Priority: Major
>  Labels: pull-request-available
>
> Currently the awaitTermination() method of connect's JVM client's 
> StreamingQuery won't throw error when there is an exception. 
>  
> In Python connect this is directly handled by python client's error-handling 
> framework but such is not existed in JVM client right now.
>  
> We should verify it works when JVM adds that
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-43299) JVM Client throw StreamingQueryException when error handling is implemented

2023-10-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-43299.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42859
[https://github.com/apache/spark/pull/42859]

> JVM Client throw StreamingQueryException when error handling is implemented
> ---
>
> Key: SPARK-43299
> URL: https://issues.apache.org/jira/browse/SPARK-43299
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Assignee: Yihong He
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently the awaitTermination() method of connect's JVM client's 
> StreamingQuery won't throw error when there is an exception. 
>  
> In Python connect this is directly handled by python client's error-handling 
> framework but such is not existed in JVM client right now.
>  
> We should verify it works when JVM adds that
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45473) Incorrect error message for RoundBase

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45473:
---
Labels: pull-request-available  (was: )

> Incorrect error message for RoundBase
> -
>
> Key: SPARK-45473
> URL: https://issues.apache.org/jira/browse/SPARK-45473
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: L. C. Hsieh
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44434) Add more tests for Scala foreachBatch and streaming listeners

2023-10-09 Thread Brian Carlson (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773501#comment-17773501
 ] 

Brian Carlson commented on SPARK-44434:
---

I'd like to work on this.

> Add more tests for Scala foreachBatch and streaming listeners 
> --
>
> Key: SPARK-44434
> URL: https://issues.apache.org/jira/browse/SPARK-44434
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.4.1
>Reporter: Raghu Angadi
>Priority: Major
>
> Currently there are very few tests for Scala foreachBatch. Consider adding 
> more tests and covering more test scenarios (multiple queries etc). 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45419) Avoid reusing rocksdb sst files in a dfferent rocksdb instance by removing file version map entry of larger versions

2023-10-09 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-45419.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43174
[https://github.com/apache/spark/pull/43174]

> Avoid reusing rocksdb sst files in a dfferent rocksdb instance by removing 
> file version map entry of larger versions
> 
>
> Key: SPARK-45419
> URL: https://issues.apache.org/jira/browse/SPARK-45419
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Chaoqin Li
>Assignee: Chaoqin Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When loading a rocksdb instance, remove file version map entry of larger 
> versions to avoid rocksdb sst file unique id mismatch exception. The SST 
> files in larger versions can't be reused even if they have the same size and 
> name because they belong to another rocksdb instance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45443) Revisit TableCacheQueryStage to avoid replicated InMemoryRelation materialization

2023-10-09 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17773507#comment-17773507
 ] 

XiDuo You commented on SPARK-45443:
---

> But this idea only work for one query

Please see the following `say, if there are multi-queries which reference the 
same caced RDD (e.g., in thiftserver)`. There are some race condition if 
multi-queries reference and materialize the same cached rdd. They are in 
different query execution and different thread.

> spark.sql.optimizer.canChangeCachedPlanOutputPartitioning

It is totally irrelevant with the TableCacheQueryStage. This config is used to 
make AQE work for the cached plan.

> Revisit TableCacheQueryStage to avoid replicated InMemoryRelation 
> materialization
> -
>
> Key: SPARK-45443
> URL: https://issues.apache.org/jira/browse/SPARK-45443
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: IMR Materialization - Stage 2.png, IMR Materialization - 
> Stage 3.png
>
>
> TableCacheQueryStage is created per InMemoryTableScanExec by 
> AdaptiveSparkPlanExec and it materializes InMemoryTableScanExec output 
> (cached RDD) to provide runtime stats in order to apply AQE  optimizations 
> into remaining physical plan stages. TableCacheQueryStage materializes 
> InMemoryTableScanExec eagerly by submitting job per TableCacheQueryStage 
> instance. For example, if there are 2 TableCacheQueryStage instances 
> referencing same IMR instance (cached RDD) and first InMemoryTableScanExec' s 
> materialization takes longer, following logic will return false 
> (inMemoryTableScan.isMaterialized => false) and this may cause replicated IMR 
> materialization. This behavior can be more visible when cached RDD size is 
> high.
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/QueryStageExec.scala#L281]
> Would like to get community feedback. Thanks in advance.
> cc [~ulysses] [~cloud_fan]
> *Sample Query to simulate the problem:*
> // Both join legs uses same IMR instance
> {code:java}
> import spark.implicits._
> val arr = (1 to 12).map { i => {
> val index = i % 5
> (index, s"Employee_$index", s"Department_$index")
>   }
> }
> val df = arr.toDF("id", "name", "department")
>   .filter('id >= 0)
>   .sort("id")
>   .groupBy('id, 'name, 'department)
>   .count().as("count")
> df.persist()
> val df2 = df.sort("count").filter('count <= 2)
> val df3 = df.sort("count").filter('count >= 3)
> val df4 = df2.join(df3, Seq("id", "name", "department"), "fullouter")
> df4.show() {code}
> *Physical Plan:*
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (31)
> +- == Final Plan ==
>    CollectLimit (21)
>    +- * Project (20)
>       +- * SortMergeJoin FullOuter (19)
>          :- * Sort (10)
>          :  +- * Filter (9)
>          :     +- TableCacheQueryStage (8), Statistics(sizeInBytes=210.0 B, 
> rowCount=5)
>          :        +- InMemoryTableScan (1)
>          :              +- InMemoryRelation (2)
>          :                    +- AdaptiveSparkPlan (7)
>          :                       +- HashAggregate (6)
>          :                          +- Exchange (5)
>          :                             +- HashAggregate (4)
>          :                                +- LocalTableScan (3)
>          +- * Sort (18)
>             +- * Filter (17)
>                +- TableCacheQueryStage (16), Statistics(sizeInBytes=210.0 B, 
> rowCount=5)
>                   +- InMemoryTableScan (11)
>                         +- InMemoryRelation (12)
>                               +- AdaptiveSparkPlan (15)
>                                  +- HashAggregate (14)
>                                     +- Exchange (13)
>                                        +- HashAggregate (4)
>                                           +- LocalTableScan (3) {code}
> *Stages DAGs materializing the same IMR instance:*
> !IMR Materialization - Stage 2.png|width=303,height=134!
> !IMR Materialization - Stage 3.png|width=303,height=134!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45472) RocksDB State Store Doesn't Need to Recheck checkpoint path existence

2023-10-09 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-45472:


Assignee: Siying Dong

> RocksDB State Store Doesn't Need to Recheck checkpoint path existence
> -
>
> Key: SPARK-45472
> URL: https://issues.apache.org/jira/browse/SPARK-45472
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
>  Labels: pull-request-available
>
> Right now, every time RocksDB.load() is called, we check checkpoint directory 
> existence and create it if not. This is relatively expensive and show up in 
> performance profiling. We don't need to do it the second time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45472) RocksDB State Store Doesn't Need to Recheck checkpoint path existence

2023-10-09 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-45472.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43299
[https://github.com/apache/spark/pull/43299]

> RocksDB State Store Doesn't Need to Recheck checkpoint path existence
> -
>
> Key: SPARK-45472
> URL: https://issues.apache.org/jira/browse/SPARK-45472
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 3.5.1
>Reporter: Siying Dong
>Assignee: Siying Dong
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Right now, every time RocksDB.load() is called, we check checkpoint directory 
> existence and create it if not. This is relatively expensive and show up in 
> performance profiling. We don't need to do it the second time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45474) Support top-level filtering in MasterPage JSON API

2023-10-09 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-45474:
-

 Summary: Support top-level filtering in MasterPage JSON API
 Key: SPARK-45474
 URL: https://issues.apache.org/jira/browse/SPARK-45474
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Web UI
Affects Versions: 4.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45474) Support top-level filtering in MasterPage JSON API

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45474:
---
Labels: pull-request-available  (was: )

> Support top-level filtering in MasterPage JSON API
> --
>
> Key: SPARK-45474
> URL: https://issues.apache.org/jira/browse/SPARK-45474
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45475) Should use DataFrame.foreachBatch instead of RDD.foreachBatch in JdbcUtils

2023-10-09 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-45475:


 Summary: Should use DataFrame.foreachBatch instead of 
RDD.foreachBatch in JdbcUtils
 Key: SPARK-45475
 URL: https://issues.apache.org/jira/browse/SPARK-45475
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Hyukjin Kwon


See https://github.com/apache/spark/pull/39976#issuecomment-1752930380



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45475) Should use DataFrame.foreachPartition instead of RDD.foreachPartition in JdbcUtils

2023-10-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-45475:
-
Summary: Should use DataFrame.foreachPartition instead of 
RDD.foreachPartition in JdbcUtils  (was: Should use DataFrame.foreachBatch 
instead of RDD.foreachBatch in JdbcUtils)

> Should use DataFrame.foreachPartition instead of RDD.foreachPartition in 
> JdbcUtils
> --
>
> Key: SPARK-45475
> URL: https://issues.apache.org/jira/browse/SPARK-45475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> See https://github.com/apache/spark/pull/39976#issuecomment-1752930380



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45475) Should use DataFrame.foreachPartition instead of RDD.foreachPartition in JdbcUtils

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45475:
---
Labels: pull-request-available  (was: )

> Should use DataFrame.foreachPartition instead of RDD.foreachPartition in 
> JdbcUtils
> --
>
> Key: SPARK-45475
> URL: https://issues.apache.org/jira/browse/SPARK-45475
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: pull-request-available
>
> See https://github.com/apache/spark/pull/39976#issuecomment-1752930380



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45450) Fix imports according to PEP8: pyspark.pandas and pyspark (core)

2023-10-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-45450.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43257
[https://github.com/apache/spark/pull/43257]

> Fix imports according to PEP8: pyspark.pandas and pyspark (core)
> 
>
> Key: SPARK-45450
> URL: https://issues.apache.org/jira/browse/SPARK-45450
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> https://peps.python.org/pep-0008/#imports



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45476) Raise exception directly instead of calling `resolveColumnsByPosition`

2023-10-09 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-45476:
---

 Summary: Raise exception directly instead of calling 
`resolveColumnsByPosition`
 Key: SPARK-45476
 URL: https://issues.apache.org/jira/browse/SPARK-45476
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 4.0.0
Reporter: Haejoon Lee


We can directly throw error when resolving output columns if there is any 
error, instead of calling {{resolveColumnsByPosition}} again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45476) Raise exception directly instead of calling `resolveColumnsByPosition`

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45476:
---
Labels: pull-request-available  (was: )

> Raise exception directly instead of calling `resolveColumnsByPosition`
> --
>
> Key: SPARK-45476
> URL: https://issues.apache.org/jira/browse/SPARK-45476
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> We can directly throw error when resolving output columns if there is any 
> error, instead of calling {{resolveColumnsByPosition}} again.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45470) Avoid paste string value of hive orc compression kind

2023-10-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45470.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43296
[https://github.com/apache/spark/pull/43296]

> Avoid paste string value of hive orc compression kind
> -
>
> Key: SPARK-45470
> URL: https://issues.apache.org/jira/browse/SPARK-45470
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Jiaan Geng
>Assignee: Jiaan Geng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Currently, hive supports orc format with some compression codec( Please refer 
> org.apache.hadoop.hive.ql.io.orc.CompressionKind).
> Spark pasted many string name of these compression codec.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45477) Use `matrix.java/inputs.java` to replace the hardcoded Java version in `test results/unit tests log` naming

2023-10-09 Thread Yang Jie (Jira)
Yang Jie created SPARK-45477:


 Summary: Use `matrix.java/inputs.java` to replace the hardcoded 
Java version in `test results/unit tests log` naming
 Key: SPARK-45477
 URL: https://issues.apache.org/jira/browse/SPARK-45477
 Project: Spark
  Issue Type: Improvement
  Components: Project Infra
Affects Versions: 4.0.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45477) Use `matrix.java/inputs.java` to replace the hardcoded Java version in `test results/unit tests log` naming

2023-10-09 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45477:
---
Labels: pull-request-available  (was: )

> Use `matrix.java/inputs.java` to replace the hardcoded Java version in `test 
> results/unit tests log` naming
> ---
>
> Key: SPARK-45477
> URL: https://issues.apache.org/jira/browse/SPARK-45477
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 4.0.0
>Reporter: Yang Jie
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45458) Convert IllegalArgumentException to SparkIllegalArgumentException in bitwiseExpressions and add some UT

2023-10-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-45458.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43271
[https://github.com/apache/spark/pull/43271]

> Convert IllegalArgumentException to SparkIllegalArgumentException in 
> bitwiseExpressions and add some UT
> ---
>
> Key: SPARK-45458
> URL: https://issues.apache.org/jira/browse/SPARK-45458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45458) Convert IllegalArgumentException to SparkIllegalArgumentException in bitwiseExpressions and add some UT

2023-10-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-45458:


Assignee: BingKun Pan

> Convert IllegalArgumentException to SparkIllegalArgumentException in 
> bitwiseExpressions and add some UT
> ---
>
> Key: SPARK-45458
> URL: https://issues.apache.org/jira/browse/SPARK-45458
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45205) Since version 3.2.0, Spark SQL has taken longer to execute "show paritions",probably because of changes introduced by SPARK-35278

2023-10-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45205.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43270
[https://github.com/apache/spark/pull/43270]

> Since version 3.2.0, Spark SQL has taken longer to execute "show 
> paritions",probably because of changes introduced by  SPARK-35278
> --
>
> Key: SPARK-45205
> URL: https://issues.apache.org/jira/browse/SPARK-45205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Qiang Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> After SPARK-35378 was changed, I noticed that the execution of statements 
> such as ‘show parititions test' became slower.
> The change point is that the execution process changes from 
> ExecutedCommandEnec to CommandResultExec, but ExecutedCommandExec originally 
> implemented the following method
> override def executeToIterator(): Iterator[InternalRow] = 
> sideEffectResult.iterator
> CommandResultExec is not rewritten, so when the hasNext method is executed, a 
> job process is created, resulting in increased time-consuming



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45205) Since version 3.2.0, Spark SQL has taken longer to execute "show paritions",probably because of changes introduced by SPARK-35278

2023-10-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45205:
---

Assignee: Qiang Yang

> Since version 3.2.0, Spark SQL has taken longer to execute "show 
> paritions",probably because of changes introduced by  SPARK-35278
> --
>
> Key: SPARK-45205
> URL: https://issues.apache.org/jira/browse/SPARK-45205
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Qiang Yang
>Assignee: Qiang Yang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> After SPARK-35378 was changed, I noticed that the execution of statements 
> such as ‘show parititions test' became slower.
> The change point is that the execution process changes from 
> ExecutedCommandEnec to CommandResultExec, but ExecutedCommandExec originally 
> implemented the following method
> override def executeToIterator(): Iterator[InternalRow] = 
> sideEffectResult.iterator
> CommandResultExec is not rewritten, so when the hasNext method is executed, a 
> job process is created, resulting in increased time-consuming



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45449) Cache Invalidation Issue with JDBC Table

2023-10-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-45449.
-
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 43258
[https://github.com/apache/spark/pull/43258]

> Cache Invalidation Issue with JDBC Table
> 
>
> Key: SPARK-45449
> URL: https://issues.apache.org/jira/browse/SPARK-45449
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: liangyongyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>
> We have identified a cache invalidation issue when caching JDBC tables in 
> Spark SQL. The cached table is unexpectedly invalidated when queried, leading 
> to a re-read from the JDBC table instead of retrieving data from the cache.
> Example SQL:
> {code:java}
> CACHE TABLE cache_t SELECT * FROM mysql.test.test1;
> SELECT * FROM cache_t;
> {code}
> Expected Behavior:
> The expectation is that querying the cached table (cache_t) should retrieve 
> the result from the cache without re-evaluating the execution plan.
> Actual Behavior:
> However, the cache is invalidated, and the content is re-read from the JDBC 
> table.
> Root Cause:
> The issue lies in the 'CacheData' class, where the comparison involves 
> 'JDBCTable.' The 'JDBCTable' is a case class:
> {code:java}
> case class JDBCTable(ident: Identifier, schema: StructType, jdbcOptions: 
> JDBCOptions)
> {code}
> The comparison of non-case class components, such as 'jdbcOptions,' involves 
> pointer comparison. This leads to unnecessary cache invalidation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45449) Cache Invalidation Issue with JDBC Table

2023-10-09 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-45449:
---

Assignee: liangyongyuan

> Cache Invalidation Issue with JDBC Table
> 
>
> Key: SPARK-45449
> URL: https://issues.apache.org/jira/browse/SPARK-45449
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: liangyongyuan
>Assignee: liangyongyuan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0, 3.5.1
>
>
> We have identified a cache invalidation issue when caching JDBC tables in 
> Spark SQL. The cached table is unexpectedly invalidated when queried, leading 
> to a re-read from the JDBC table instead of retrieving data from the cache.
> Example SQL:
> {code:java}
> CACHE TABLE cache_t SELECT * FROM mysql.test.test1;
> SELECT * FROM cache_t;
> {code}
> Expected Behavior:
> The expectation is that querying the cached table (cache_t) should retrieve 
> the result from the cache without re-evaluating the execution plan.
> Actual Behavior:
> However, the cache is invalidated, and the content is re-read from the JDBC 
> table.
> Root Cause:
> The issue lies in the 'CacheData' class, where the comparison involves 
> 'JDBCTable.' The 'JDBCTable' is a case class:
> {code:java}
> case class JDBCTable(ident: Identifier, schema: StructType, jdbcOptions: 
> JDBCOptions)
> {code}
> The comparison of non-case class components, such as 'jdbcOptions,' involves 
> pointer comparison. This leads to unnecessary cache invalidation.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45474) Support top-level filtering in MasterPage JSON API

2023-10-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45474.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 43303
[https://github.com/apache/spark/pull/43303]

> Support top-level filtering in MasterPage JSON API
> --
>
> Key: SPARK-45474
> URL: https://issues.apache.org/jira/browse/SPARK-45474
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45474) Support top-level filtering in MasterPage JSON API

2023-10-09 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45474:
-

Assignee: Dongjoon Hyun

> Support top-level filtering in MasterPage JSON API
> --
>
> Key: SPARK-45474
> URL: https://issues.apache.org/jira/browse/SPARK-45474
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 4.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org