date:20230220

[jira] [Created] (SPARK-42508) Extract the common .ml classes to `mllib-common`

2023-02-20 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-42508:
-

 Summary: Extract the common .ml classes to `mllib-common`
 Key: SPARK-42508
 URL: https://issues.apache.org/jira/browse/SPARK-42508
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, ML
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42507) Simplify ORC schema merging conflict error check

2023-02-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691420#comment-17691420
 ] 

Apache Spark commented on SPARK-42507:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40101

> Simplify ORC schema merging conflict error check
> 
>
> Key: SPARK-42507
> URL: https://issues.apache.org/jira/browse/SPARK-42507
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42507) Simplify ORC schema merging conflict error check

2023-02-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42507:


Assignee: (was: Apache Spark)

> Simplify ORC schema merging conflict error check
> 
>
> Key: SPARK-42507
> URL: https://issues.apache.org/jira/browse/SPARK-42507
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42507) Simplify ORC schema merging conflict error check

2023-02-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691419#comment-17691419
 ] 

Apache Spark commented on SPARK-42507:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/40101

> Simplify ORC schema merging conflict error check
> 
>
> Key: SPARK-42507
> URL: https://issues.apache.org/jira/browse/SPARK-42507
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42507) Simplify ORC schema merging conflict error check

2023-02-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42507:


Assignee: Apache Spark

> Simplify ORC schema merging conflict error check
> 
>
> Key: SPARK-42507
> URL: https://issues.apache.org/jira/browse/SPARK-42507
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42507) Simplify ORC schema merging conflict error check

2023-02-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42507:
--
Summary: Simplify ORC schema merging conflict error check  (was: Simplify 
schema merging conflict error check)

> Simplify ORC schema merging conflict error check
> 
>
> Key: SPARK-42507
> URL: https://issues.apache.org/jira/browse/SPARK-42507
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42507) Simplify schema merging conflict error check

2023-02-20 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-42507:
-

 Summary: Simplify schema merging conflict error check
 Key: SPARK-42507
 URL: https://issues.apache.org/jira/browse/SPARK-42507
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-37099) Introduce a rank-based filter to optimize top-k computation

2023-02-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-37099:
---

Assignee: jiaan.geng

> Introduce a rank-based filter to optimize top-k computation
> ---
>
> Key: SPARK-37099
> URL: https://issues.apache.org/jira/browse/SPARK-37099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
> Attachments: q67.png, q67_optimized.png, skewed_window.png
>
>
> in JD, we found that more than 90% usage of window function follows this 
> pattern:
> {code:java}
>  select (... (row_number|rank|dense_rank) () over( [partition by ...] order 
> by ... ) as rn)
> where rn (==|<|<=) k and other conditions{code}
>  
> However, existing physical plan is not optimum:
>  
> 1, we should select local top-k records within each partitions, and then 
> compute the global top-k. this can help reduce the shuffle amount;
>  
> For these three rank functions (row_number|rank|dense_rank), the rank of a 
> key computed on partitial dataset  is always <=  its final rank computed on 
> the whole dataset. so we can safely discard rows with partitial rank > k, 
> anywhere.
>  
>  
> 2, skewed-window: some partition is skewed and take a long time to finish 
> computation.
>  
> A real-world skewed-window case in our system is attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-37099) Introduce a rank-based filter to optimize top-k computation

2023-02-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-37099.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 38799
[https://github.com/apache/spark/pull/38799]

> Introduce a rank-based filter to optimize top-k computation
> ---
>
> Key: SPARK-37099
> URL: https://issues.apache.org/jira/browse/SPARK-37099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
> Fix For: 3.5.0
>
> Attachments: q67.png, q67_optimized.png, skewed_window.png
>
>
> in JD, we found that more than 90% usage of window function follows this 
> pattern:
> {code:java}
>  select (... (row_number|rank|dense_rank) () over( [partition by ...] order 
> by ... ) as rn)
> where rn (==|<|<=) k and other conditions{code}
>  
> However, existing physical plan is not optimum:
>  
> 1, we should select local top-k records within each partitions, and then 
> compute the global top-k. this can help reduce the shuffle amount;
>  
> For these three rank functions (row_number|rank|dense_rank), the rank of a 
> key computed on partitial dataset  is always <=  its final rank computed on 
> the whole dataset. so we can safely discard rows with partitial rank > k, 
> anywhere.
>  
>  
> 2, skewed-window: some partition is skewed and take a long time to finish 
> computation.
>  
> A real-world skewed-window case in our system is attached.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42506) Fix Sort's maxRowsPerPartition if maxRows does not exist

2023-02-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42506:


Assignee: (was: Apache Spark)

> Fix Sort's maxRowsPerPartition if maxRows does not exist
> 
>
> Key: SPARK-42506
> URL: https://issues.apache.org/jira/browse/SPARK-42506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42506) Fix Sort's maxRowsPerPartition if maxRows does not exist

2023-02-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42506:


Assignee: Apache Spark

> Fix Sort's maxRowsPerPartition if maxRows does not exist
> 
>
> Key: SPARK-42506
> URL: https://issues.apache.org/jira/browse/SPARK-42506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42506) Fix Sort's maxRowsPerPartition if maxRows does not exist

2023-02-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691414#comment-17691414
 ] 

Apache Spark commented on SPARK-42506:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/40100

> Fix Sort's maxRowsPerPartition if maxRows does not exist
> 
>
> Key: SPARK-42506
> URL: https://issues.apache.org/jira/browse/SPARK-42506
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42506) Fix Sort's maxRowsPerPartition if maxRows does not exist

2023-02-20 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-42506:
---

 Summary: Fix Sort's maxRowsPerPartition if maxRows does not exist
 Key: SPARK-42506
 URL: https://issues.apache.org/jira/browse/SPARK-42506
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42505) Apply entrypoint template change to 3.3.0/3.3.1

2023-02-20 Thread Yikun Jiang (Jira)

Yikun Jiang created SPARK-42505:
---

 Summary: Apply entrypoint template change to 3.3.0/3.3.1
 Key: SPARK-42505
 URL: https://issues.apache.org/jira/browse/SPARK-42505
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Docker
Affects Versions: 3.5.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42494) Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang resolved SPARK-42494.
-
Resolution: Fixed

Resolved by https://github.com/apache/spark-docker/pull/30

> Add official image Dockerfile for Spark v3.3.2
> --
>
> Key: SPARK-42494
> URL: https://issues.apache.org/jira/browse/SPARK-42494
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.3.2
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42494) Add official image Dockerfile for Spark v3.3.2

2023-02-20 Thread Yikun Jiang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yikun Jiang reassigned SPARK-42494:
---

Assignee: Yikun Jiang

> Add official image Dockerfile for Spark v3.3.2
> --
>
> Key: SPARK-42494
> URL: https://issues.apache.org/jira/browse/SPARK-42494
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Docker
>Affects Versions: 3.3.2
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed

2023-02-20 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691395#comment-17691395
 ] 

Yang Jie commented on SPARK-40278:
--

SQL not failed, UI may break. [~ulysses] explained this at 
[https://github.com/apache/spark/pull/35149#issuecomment-1231712806] and 

he should have tried to fix the issue,  but I'm not sure whether it has been 
fixed

 

 

 

> Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed
> --
>
> Key: SPARK-40278
> URL: https://issues.apache.org/jira/browse/SPARK-40278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, 
> the test code as follows:
> {code:java}
> val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T"
> val databaseName = "tpcds_database"
> val scaleFactor = "3072"
> val format = "parquet"
> import com.databricks.spark.sql.perf.tpcds.TPCDSTables
> val tables = new TPCDSTables(
>       spark.sqlContext,dsdgenDir = "./tpcds-kit/tools",
>       scaleFactor = scaleFactor,
>       useDoubleForDecimal = false,useStringForDate = false)
> spark.sql(s"create database $databaseName")
> tables.createTemporaryTables(rootDir, format)
> spark.sql(s"use $databaseName")// TPCDS 24a or 24b
> val result = spark.sql(""" with ssales as
>  (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
>         i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) 
> netpaid
>  from store_sales, store_returns, store, item, customer, customer_address
>  where ss_ticket_number = sr_ticket_number
>    and ss_item_sk = sr_item_sk
>    and ss_customer_sk = c_customer_sk
>    and ss_item_sk = i_item_sk
>    and ss_store_sk = s_store_sk
>    and c_birth_country = upper(ca_country)
>    and s_zip = ca_zip
>  and s_market_id = 8
>  group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
>           i_current_price, i_manager_id, i_units, i_size)
>  select c_last_name, c_first_name, s_store_name, sum(netpaid) paid
>  from ssales
>  where i_color = 'pale'
>  group by c_last_name, c_first_name, s_store_name
>  having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect()
>  sc.stop() {code}
> The above test may failed due to `Stage cancelled because SparkContext was 
> shut down` of stage 31 and stage 36 when AQE enabled as follows:
>  
> !image-2022-08-30-21-09-48-763.png!
> !image-2022-08-30-21-10-24-862.png!
> !image-2022-08-30-21-10-57-128.png!
>  
> The DAG corresponding to sql is as follows:
> !image-2022-08-30-21-11-50-895.png!
> The details as follows:
>  
>  
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (42)
> +- == Final Plan ==
>LocalTableScan (1)
> +- == Initial Plan ==
>Filter (41)
>+- HashAggregate (40)
>   +- Exchange (39)
>  +- HashAggregate (38)
> +- HashAggregate (37)
>+- Exchange (36)
>   +- HashAggregate (35)
>  +- Project (34)
> +- BroadcastHashJoin Inner BuildRight (33)
>:- Project (29)
>:  +- BroadcastHashJoin Inner BuildRight (28)
>: :- Project (24)
>: :  +- BroadcastHashJoin Inner BuildRight (23)
>: : :- Project (19)
>: : :  +- BroadcastHashJoin Inner 
> BuildRight (18)
>: : : :- Project (13)
>: : : :  +- SortMergeJoin Inner (12)
>: : : : :- Sort (6)
>: : : : :  +- Exchange (5)
>: : : : : +- Project (4)
>: : : : :+- Filter (3)
>: : : : :   +- Scan 
> parquet  (2)
>: : : : +- Sort (11)
>: : : :+- Exchange (10)
>: : : :   +- Project (9)
>: : : :  +- Filter (8)
>: : : : +- Scan 
> parquet  (7)
>: : : +- BroadcastExchange (17)
>: : :+- Project (16)
>: : :   +- Filter (15)
>: : :  +- Scan parquet  (14)
>: :

[jira] [Comment Edited] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed

2023-02-20 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691395#comment-17691395
 ] 

Yang Jie edited comment on SPARK-40278 at 2/21/23 5:49 AM:
---

SQL not failed, UI may break. [~ulysses] explained this at 
[https://github.com/apache/spark/pull/35149#issuecomment-1231712806] and he 
should have tried to fix the issue,  but I'm not sure whether it has been fixed

 

 

 


was (Author: luciferyang):
SQL not failed, UI may break. [~ulysses] explained this at 
[https://github.com/apache/spark/pull/35149#issuecomment-1231712806] and 

he should have tried to fix the issue,  but I'm not sure whether it has been 
fixed

 

 

 

> Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed
> --
>
> Key: SPARK-40278
> URL: https://issues.apache.org/jira/browse/SPARK-40278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, 
> the test code as follows:
> {code:java}
> val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T"
> val databaseName = "tpcds_database"
> val scaleFactor = "3072"
> val format = "parquet"
> import com.databricks.spark.sql.perf.tpcds.TPCDSTables
> val tables = new TPCDSTables(
>       spark.sqlContext,dsdgenDir = "./tpcds-kit/tools",
>       scaleFactor = scaleFactor,
>       useDoubleForDecimal = false,useStringForDate = false)
> spark.sql(s"create database $databaseName")
> tables.createTemporaryTables(rootDir, format)
> spark.sql(s"use $databaseName")// TPCDS 24a or 24b
> val result = spark.sql(""" with ssales as
>  (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
>         i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) 
> netpaid
>  from store_sales, store_returns, store, item, customer, customer_address
>  where ss_ticket_number = sr_ticket_number
>    and ss_item_sk = sr_item_sk
>    and ss_customer_sk = c_customer_sk
>    and ss_item_sk = i_item_sk
>    and ss_store_sk = s_store_sk
>    and c_birth_country = upper(ca_country)
>    and s_zip = ca_zip
>  and s_market_id = 8
>  group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
>           i_current_price, i_manager_id, i_units, i_size)
>  select c_last_name, c_first_name, s_store_name, sum(netpaid) paid
>  from ssales
>  where i_color = 'pale'
>  group by c_last_name, c_first_name, s_store_name
>  having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect()
>  sc.stop() {code}
> The above test may failed due to `Stage cancelled because SparkContext was 
> shut down` of stage 31 and stage 36 when AQE enabled as follows:
>  
> !image-2022-08-30-21-09-48-763.png!
> !image-2022-08-30-21-10-24-862.png!
> !image-2022-08-30-21-10-57-128.png!
>  
> The DAG corresponding to sql is as follows:
> !image-2022-08-30-21-11-50-895.png!
> The details as follows:
>  
>  
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (42)
> +- == Final Plan ==
>LocalTableScan (1)
> +- == Initial Plan ==
>Filter (41)
>+- HashAggregate (40)
>   +- Exchange (39)
>  +- HashAggregate (38)
> +- HashAggregate (37)
>+- Exchange (36)
>   +- HashAggregate (35)
>  +- Project (34)
> +- BroadcastHashJoin Inner BuildRight (33)
>:- Project (29)
>:  +- BroadcastHashJoin Inner BuildRight (28)
>: :- Project (24)
>: :  +- BroadcastHashJoin Inner BuildRight (23)
>: : :- Project (19)
>: : :  +- BroadcastHashJoin Inner 
> BuildRight (18)
>: : : :- Project (13)
>: : : :  +- SortMergeJoin Inner (12)
>: : : : :- Sort (6)
>: : : : :  +- Exchange (5)
>: : : : : +- Project (4)
>: : : : :+- Filter (3)
>: : : : :   +- Scan 
> parquet  (2)
>: : : : +- Sort (11)
>: : : :+- Exchange (10)
>: : : :   +- Project (9)
>: : : :  +- Filter (8)
>: : : : +- Scan 
> parquet  (7)
>

[jira] [Commented] (SPARK-42503) Spark SQL should do further validation on join condition fields

2023-02-20 Thread zzzzming95 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691391#comment-17691391
 ] 

ming95 commented on SPARK-42503:


I test hive and mysql , they also do not have this validation . But I still 
think this restriction should be added, because the condition of non-left and 
right tables in join is meaningless.

> Spark SQL should do further validation on join condition fields
> ---
>
> Key: SPARK-42503
> URL: https://issues.apache.org/jira/browse/SPARK-42503
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: ming95
>Priority: Major
>
> In Spark SQL, the conditions for the join use fields that are allowed to be 
> fields from a non-left table or a non-right table. In this case, the join 
> will degenerate into a cross join. 
> Suppose you have two tables, test1 and test2, which have the same table 
> schema:
> {code:java}
> ```
> CREATE TABLE `default`.`test1` (
>   `id` INT,
>   `name` STRING,
>   `age` INT,
>   `dt` STRING)
> USING parquet
> PARTITIONED BY (dt)
> ```{code}
> The following SQL has three joins, but in the last left join, the conditions 
> is `t1.name=t2.name`, and t3.name is not used. So the last left join will be 
> cross join. 
> {code:java}
> ```
> select *
> from 
> (select * from test1 where dt="20230215"  and age=1 ) t1 
> left join
> (select * from test1 where dt=="20230215"  and age=2) t2
> on t1.name=t2.name
> left join
> (select * from test2 where dt="20230215") t3
> on
> t1.name=t2.name;
> ```{code}
> So i think Spark SQL should do further validation on join condition, the 
> fields of join condition must be a left table or right table field , 
> otherwise it is thrown `AnalysisException`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed

2023-02-20 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691388#comment-17691388
 ] 

Yuming Wang commented on SPARK-40278:
-

[~LuciferYang] Is this issue still exist?

> Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed
> --
>
> Key: SPARK-40278
> URL: https://issues.apache.org/jira/browse/SPARK-40278
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Major
>
> I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, 
> the test code as follows:
> {code:java}
> val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T"
> val databaseName = "tpcds_database"
> val scaleFactor = "3072"
> val format = "parquet"
> import com.databricks.spark.sql.perf.tpcds.TPCDSTables
> val tables = new TPCDSTables(
>       spark.sqlContext,dsdgenDir = "./tpcds-kit/tools",
>       scaleFactor = scaleFactor,
>       useDoubleForDecimal = false,useStringForDate = false)
> spark.sql(s"create database $databaseName")
> tables.createTemporaryTables(rootDir, format)
> spark.sql(s"use $databaseName")// TPCDS 24a or 24b
> val result = spark.sql(""" with ssales as
>  (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
>         i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) 
> netpaid
>  from store_sales, store_returns, store, item, customer, customer_address
>  where ss_ticket_number = sr_ticket_number
>    and ss_item_sk = sr_item_sk
>    and ss_customer_sk = c_customer_sk
>    and ss_item_sk = i_item_sk
>    and ss_store_sk = s_store_sk
>    and c_birth_country = upper(ca_country)
>    and s_zip = ca_zip
>  and s_market_id = 8
>  group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color,
>           i_current_price, i_manager_id, i_units, i_size)
>  select c_last_name, c_first_name, s_store_name, sum(netpaid) paid
>  from ssales
>  where i_color = 'pale'
>  group by c_last_name, c_first_name, s_store_name
>  having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect()
>  sc.stop() {code}
> The above test may failed due to `Stage cancelled because SparkContext was 
> shut down` of stage 31 and stage 36 when AQE enabled as follows:
>  
> !image-2022-08-30-21-09-48-763.png!
> !image-2022-08-30-21-10-24-862.png!
> !image-2022-08-30-21-10-57-128.png!
>  
> The DAG corresponding to sql is as follows:
> !image-2022-08-30-21-11-50-895.png!
> The details as follows:
>  
>  
> {code:java}
> == Physical Plan ==
> AdaptiveSparkPlan (42)
> +- == Final Plan ==
>LocalTableScan (1)
> +- == Initial Plan ==
>Filter (41)
>+- HashAggregate (40)
>   +- Exchange (39)
>  +- HashAggregate (38)
> +- HashAggregate (37)
>+- Exchange (36)
>   +- HashAggregate (35)
>  +- Project (34)
> +- BroadcastHashJoin Inner BuildRight (33)
>:- Project (29)
>:  +- BroadcastHashJoin Inner BuildRight (28)
>: :- Project (24)
>: :  +- BroadcastHashJoin Inner BuildRight (23)
>: : :- Project (19)
>: : :  +- BroadcastHashJoin Inner 
> BuildRight (18)
>: : : :- Project (13)
>: : : :  +- SortMergeJoin Inner (12)
>: : : : :- Sort (6)
>: : : : :  +- Exchange (5)
>: : : : : +- Project (4)
>: : : : :+- Filter (3)
>: : : : :   +- Scan 
> parquet  (2)
>: : : : +- Sort (11)
>: : : :+- Exchange (10)
>: : : :   +- Project (9)
>: : : :  +- Filter (8)
>: : : : +- Scan 
> parquet  (7)
>: : : +- BroadcastExchange (17)
>: : :+- Project (16)
>: : :   +- Filter (15)
>: : :  +- Scan parquet  (14)
>: : +- BroadcastExchange (22)
>: :+- Filter (21)
>: :   +- Scan parquet  (20)
>

[jira] [Commented] (SPARK-40610) Spark fall back to use getPartitions instead of getPartitionsByFilter when date_add functions used in where clause

2023-02-20 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-40610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691387#comment-17691387
 ] 

Yuming Wang commented on SPARK-40610:
-

[~icyjhl] What's your dt data type? date, string or timestamp?

> Spark fall back to use getPartitions instead of getPartitionsByFilter when 
> date_add functions used in where clause 
> ---
>
> Key: SPARK-40610
> URL: https://issues.apache.org/jira/browse/SPARK-40610
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment: edw.tmp_test_metastore_usage_source is a big table with 
> 1000 partitions and hundreds of columns
>Reporter: icyjhl
>Priority: Major
> Attachments: spark_error.log, spark_sql.sql, sql_in_mysql.sql
>
>
> When I run a insert overwrite statement, I got error saying:
>  
> {code:java}
> MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. 
> listPartitions {code}
>  
> It's weird as I only selected for about 3 partitions, so I rerun the sql and 
> checked the metastore, then I found it's fetching all columns in all 
> partitions:
>  
> {code:java}
> select "CD_ID", "COMMENT", "COLUMN_NAME", "TYPE_NAME" from "COLUMNS_V2" where 
> "CD_ID" 
> in 
> (675384,675393,675385,675394,675396,675397,675395,675398,675399,675401,675402,675400,675406……){code}
>  
>  
> After testing, I found the problem is with the date_add function in where 
> clause, if remove it ,sql works fine, else metastore would fetch all columns 
> in all partitions.
>  
>  
> {code:java}
> insert overwrite table test.tmp_test_metastore_usage
> SELECT userid
>     ,SUBSTR(sendtime,1,10) AS creation_date
>     ,cast(json_bh_esdate_deltadays_max as DECIMAL(38,2)) AS 
> bh_esdate_deltadays_max
>     ,json_bh_qiye_industryphyname AS bh_qiye_industryphyname
>     ,cast(json_bh_esdate_deltadays_min as DECIMAL(38,2)) AS 
> bh_esdate_deltadays_min
>     ,cast(json_bh_subconam_min as DECIMAL(38,2)) AS bh_subconam_min
>     ,cast(json_bh_qiye_regcap_min as DECIMAL(38,2)) AS bh_qiye_regcap_min
>     ,json_bh_industryphyname AS bh_industryphyname
>     ,cast(json_bh_subconam_mean as DECIMAL(38,2)) AS bh_subconam_mean
>     ,cast(json_bh_industryphyname_nunique as DECIMAL(38,2)) AS 
> bh_industryphyname_nunique
>     ,cast(current_timestamp() as string) as dw_cre_date
>     ,cast(current_timestamp() as string) as dw_upd_date
> FROM (
>     SELECT userid
>         ,sendtime
>         ,json_bh_esdate_deltadays_max
>         ,json_bh_qiye_industryphyname
>         ,json_bh_esdate_deltadays_min
>         ,json_bh_subconam_min
>         ,json_bh_qiye_regcap_min
>         ,json_bh_industryphyname
>         ,json_bh_subconam_mean
>         ,json_bh_industryphyname_nunique
>         ,row_number() OVER (
>             PARTITION BY userid,dt ORDER BY sendtime DESC
>             ) rn
>     FROM edw.tmp_test_metastore_usage_source
>     WHERE dt >= date_add('2022-09-22',-3 )
>         AND json_bizid IN ('6101')
>         AND json_dingid IN ('611')
>     ) t
> WHERE rn = 1 {code}
>  
>  By the way 2.4.7 works good.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects

2023-02-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42504:


Assignee: Apache Spark

> NestedColumnAliasing support pruning adjacent projects
> --
>
> Key: SPARK-42504
> URL: https://issues.apache.org/jira/browse/SPARK-42504
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> CollapseProject won't combine adjacent projects into one, e.g. non-cheap 
> expression has been accessed more than once with the below project. Then 
> there would be possible to appear some adjacent project nodes that 
> NestedColumnAliasing does not support pruning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects

2023-02-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691354#comment-17691354
 ] 

Apache Spark commented on SPARK-42504:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/40098

> NestedColumnAliasing support pruning adjacent projects
> --
>
> Key: SPARK-42504
> URL: https://issues.apache.org/jira/browse/SPARK-42504
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> CollapseProject won't combine adjacent projects into one, e.g. non-cheap 
> expression has been accessed more than once with the below project. Then 
> there would be possible to appear some adjacent project nodes that 
> NestedColumnAliasing does not support pruning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects

2023-02-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42504:


Assignee: (was: Apache Spark)

> NestedColumnAliasing support pruning adjacent projects
> --
>
> Key: SPARK-42504
> URL: https://issues.apache.org/jira/browse/SPARK-42504
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> CollapseProject won't combine adjacent projects into one, e.g. non-cheap 
> expression has been accessed more than once with the below project. Then 
> there would be possible to appear some adjacent project nodes that 
> NestedColumnAliasing does not support pruning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects

2023-02-20 Thread XiDuo You (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-42504:
--
Description: 
CollapseProject won't combine adjacent projects into one, e.g. non-cheap 
expression has been accessed more than once with the below project. Then there 
would be possible to appear some adjacent project nodes that 
NestedColumnAliasing does not support pruning.



  was:
CollapseProject won't combine adjacent projects into one, e.g. non-cheap 
expression has been accessed more than once with the below project.



> NestedColumnAliasing support pruning adjacent projects
> --
>
> Key: SPARK-42504
> URL: https://issues.apache.org/jira/browse/SPARK-42504
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> CollapseProject won't combine adjacent projects into one, e.g. non-cheap 
> expression has been accessed more than once with the below project. Then 
> there would be possible to appear some adjacent project nodes that 
> NestedColumnAliasing does not support pruning.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects

2023-02-20 Thread XiDuo You (Jira)

XiDuo You created SPARK-42504:
-

 Summary: NestedColumnAliasing support pruning adjacent projects
 Key: SPARK-42504
 URL: https://issues.apache.org/jira/browse/SPARK-42504
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You


CollapseProject won't combine adjacent projects into one, e.g. non-cheap 
expression has been accessed more than once with the below project.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42501) High level design doc for Spark ML

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42501:
-
Parent: SPARK-42471
Issue Type: Sub-task  (was: New Feature)

> High level design doc for Spark ML
> --
>
> Key: SPARK-42501
> URL: https://issues.apache.org/jira/browse/SPARK-42501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation, ML
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42412) Initial prototype implementation for PySparkML

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42412:
-
Epic Link: (was: SPARK-39375)

> Initial prototype implementation for PySparkML
> --
>
> Key: SPARK-42412
> URL: https://issues.apache.org/jira/browse/SPARK-42412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42412) Initial prototype implementation for PySparkML

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42412:
-
Parent: SPARK-42471
Issue Type: Sub-task  (was: New Feature)

> Initial prototype implementation for PySparkML
> --
>
> Key: SPARK-42412
> URL: https://issues.apache.org/jira/browse/SPARK-42412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42412) Initial prototype implementation for PySparkML

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42412:
-
Epic Link: SPARK-39375

> Initial prototype implementation for PySparkML
> --
>
> Key: SPARK-42412
> URL: https://issues.apache.org/jira/browse/SPARK-42412
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42501) High level design doc for Spark ML

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42501:
-
Parent: (was: SPARK-39375)
Issue Type: New Feature  (was: Sub-task)

> High level design doc for Spark ML
> --
>
> Key: SPARK-42501
> URL: https://issues.apache.org/jira/browse/SPARK-42501
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, Documentation, ML
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42412) Initial prototype implementation for PySparkML

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42412:
-
Parent: (was: SPARK-39375)
Issue Type: New Feature  (was: Sub-task)

> Initial prototype implementation for PySparkML
> --
>
> Key: SPARK-42412
> URL: https://issues.apache.org/jira/browse/SPARK-42412
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42502) scala: accept user_agent in spark connect's connection string

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-42502:
-
Fix Version/s: (was: 3.4.0)

> scala: accept user_agent in spark connect's connection string
> -
>
> Key: SPARK-42502
> URL: https://issues.apache.org/jira/browse/SPARK-42502
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
>
> Currently, the Spark Connect service's {{client_type}} attribute (which is 
> really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark.
> Accept an optional {{user_agent}} parameter in the connection string and 
> plumb this down to the Spark Connect service.
> This enables partners using Spark Connect to set their application as the 
> user agent,
> which then allows visibility and measurement of integrations and usages of 
> spark
> connect.
> This is already done for the Python client: 
> https://github.com/apache/spark/commit/b887d3de954ae5b2482087fe08affcc4ac60c669



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42502) scala: accept user_agent in spark connect's connection string

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42502:


Assignee: (was: Niranjan Jayakar)

> scala: accept user_agent in spark connect's connection string
> -
>
> Key: SPARK-42502
> URL: https://issues.apache.org/jira/browse/SPARK-42502
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Priority: Major
>
> Currently, the Spark Connect service's {{client_type}} attribute (which is 
> really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark.
> Accept an optional {{user_agent}} parameter in the connection string and 
> plumb this down to the Spark Connect service.
> This enables partners using Spark Connect to set their application as the 
> user agent,
> which then allows visibility and measurement of integrations and usages of 
> spark
> connect.
> This is already done for the Python client: 
> https://github.com/apache/spark/commit/b887d3de954ae5b2482087fe08affcc4ac60c669



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42503) Spark SQL should do further validation on join condition fields

2023-02-20 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691321#comment-17691321
 ] 

Yuming Wang commented on SPARK-42503:
-

Do other databases also have this validation?

> Spark SQL should do further validation on join condition fields
> ---
>
> Key: SPARK-42503
> URL: https://issues.apache.org/jira/browse/SPARK-42503
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: ming95
>Priority: Major
>
> In Spark SQL, the conditions for the join use fields that are allowed to be 
> fields from a non-left table or a non-right table. In this case, the join 
> will degenerate into a cross join. 
> Suppose you have two tables, test1 and test2, which have the same table 
> schema:
> {code:java}
> ```
> CREATE TABLE `default`.`test1` (
>   `id` INT,
>   `name` STRING,
>   `age` INT,
>   `dt` STRING)
> USING parquet
> PARTITIONED BY (dt)
> ```{code}
> The following SQL has three joins, but in the last left join, the conditions 
> is `t1.name=t2.name`, and t3.name is not used. So the last left join will be 
> cross join. 
> {code:java}
> ```
> select *
> from 
> (select * from test1 where dt="20230215"  and age=1 ) t1 
> left join
> (select * from test1 where dt=="20230215"  and age=2) t2
> on t1.name=t2.name
> left join
> (select * from test2 where dt="20230215") t3
> on
> t1.name=t2.name;
> ```{code}
> So i think Spark SQL should do further validation on join condition, the 
> fields of join condition must be a left table or right table field , 
> otherwise it is thrown `AnalysisException`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42503) Spark SQL should do further validation on join condition fields

2023-02-20 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-42503:

Fix Version/s: (was: 3.4.0)

> Spark SQL should do further validation on join condition fields
> ---
>
> Key: SPARK-42503
> URL: https://issues.apache.org/jira/browse/SPARK-42503
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: ming95
>Priority: Major
>
> In Spark SQL, the conditions for the join use fields that are allowed to be 
> fields from a non-left table or a non-right table. In this case, the join 
> will degenerate into a cross join. 
> Suppose you have two tables, test1 and test2, which have the same table 
> schema:
> {code:java}
> ```
> CREATE TABLE `default`.`test1` (
>   `id` INT,
>   `name` STRING,
>   `age` INT,
>   `dt` STRING)
> USING parquet
> PARTITIONED BY (dt)
> ```{code}
> The following SQL has three joins, but in the last left join, the conditions 
> is `t1.name=t2.name`, and t3.name is not used. So the last left join will be 
> cross join. 
> {code:java}
> ```
> select *
> from 
> (select * from test1 where dt="20230215"  and age=1 ) t1 
> left join
> (select * from test1 where dt=="20230215"  and age=2) t2
> on t1.name=t2.name
> left join
> (select * from test2 where dt="20230215") t3
> on
> t1.name=t2.name;
> ```{code}
> So i think Spark SQL should do further validation on join condition, the 
> fields of join condition must be a left table or right table field , 
> otherwise it is thrown `AnalysisException`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42503) Spark SQL should do further validation on join condition fields

2023-02-20 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-42503:

Target Version/s:   (was: 3.4.0)

> Spark SQL should do further validation on join condition fields
> ---
>
> Key: SPARK-42503
> URL: https://issues.apache.org/jira/browse/SPARK-42503
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: ming95
>Priority: Major
> Fix For: 3.4.0
>
>
> In Spark SQL, the conditions for the join use fields that are allowed to be 
> fields from a non-left table or a non-right table. In this case, the join 
> will degenerate into a cross join. 
> Suppose you have two tables, test1 and test2, which have the same table 
> schema:
> {code:java}
> ```
> CREATE TABLE `default`.`test1` (
>   `id` INT,
>   `name` STRING,
>   `age` INT,
>   `dt` STRING)
> USING parquet
> PARTITIONED BY (dt)
> ```{code}
> The following SQL has three joins, but in the last left join, the conditions 
> is `t1.name=t2.name`, and t3.name is not used. So the last left join will be 
> cross join. 
> {code:java}
> ```
> select *
> from 
> (select * from test1 where dt="20230215"  and age=1 ) t1 
> left join
> (select * from test1 where dt=="20230215"  and age=2) t2
> on t1.name=t2.name
> left join
> (select * from test2 where dt="20230215") t3
> on
> t1.name=t2.name;
> ```{code}
> So i think Spark SQL should do further validation on join condition, the 
> fields of join condition must be a left table or right table field , 
> otherwise it is thrown `AnalysisException`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41823) DataFrame.join creating ambiguous column names

2023-02-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691290#comment-17691290
 ] 

Apache Spark commented on SPARK-41823:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/40094

> DataFrame.join creating ambiguous column names
> --
>
> Key: SPARK-41823
> URL: https://issues.apache.org/jira/browse/SPARK-41823
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 254, in pyspark.sql.connect.dataframe.DataFrame.drop
> Failed example:
>     df.join(df2, df.name == df2.name, 'inner').drop('name').show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.join(df2, df.name == df2.name, 'inner').drop('name').show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, 
> `name`].
>     Plan: {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41823) DataFrame.join creating ambiguous column names

2023-02-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691288#comment-17691288
 ] 

Apache Spark commented on SPARK-41823:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/40094

> DataFrame.join creating ambiguous column names
> --
>
> Key: SPARK-41823
> URL: https://issues.apache.org/jira/browse/SPARK-41823
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 254, in pyspark.sql.connect.dataframe.DataFrame.drop
> Failed example:
>     df.join(df2, df.name == df2.name, 'inner').drop('name').show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.join(df2, df.name == df2.name, 'inner').drop('name').show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, 
> `name`].
>     Plan: {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41823) DataFrame.join creating ambiguous column names

2023-02-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691289#comment-17691289
 ] 

Apache Spark commented on SPARK-41823:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/40094

> DataFrame.join creating ambiguous column names
> --
>
> Key: SPARK-41823
> URL: https://issues.apache.org/jira/browse/SPARK-41823
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Sandeep Singh
>Assignee: Ruifeng Zheng
>Priority: Major
>
> {code:java}
> File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 254, in pyspark.sql.connect.dataframe.DataFrame.drop
> Failed example:
>     df.join(df2, df.name == df2.name, 'inner').drop('name').show()
> Exception raised:
>     Traceback (most recent call last):
>       File 
> "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py",
>  line 1350, in __run
>         exec(compile(example.source, filename, "single",
>       File "", line 
> 1, in 
>         df.join(df2, df.name == df2.name, 'inner').drop('name').show()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 534, in show
>         print(self._show_string(n, truncate, vertical))
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 423, in _show_string
>         ).toPandas()
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", 
> line 1031, in toPandas
>         return self._session.client.to_pandas(query)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 413, in to_pandas
>         return self._execute_and_fetch(req)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 573, in _execute_and_fetch
>         self._handle_error(rpc_error)
>       File 
> "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", 
> line 619, in _handle_error
>         raise SparkConnectAnalysisException(
>     pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, 
> `name`].
>     Plan: {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41812) DataFrame.join: ambiguous column

2023-02-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691287#comment-17691287
 ] 

Apache Spark commented on SPARK-41812:
--

User 'grundprinzip' has created a pull request for this issue:
https://github.com/apache/spark/pull/40094

> DataFrame.join: ambiguous column
> 
>
> Key: SPARK-41812
> URL: https://issues.apache.org/jira/browse/SPARK-41812
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>
> {code}
> File "/.../spark/python/pyspark/sql/connect/column.py", line 106, in 
> pyspark.sql.connect.column.Column.eqNullSafe
> Failed example:
> df1.join(df2, df1["value"] == df2["value"]).count()
> Exception raised:
> Traceback (most recent call last):
>   File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line 
> 1336, in __run
> exec(compile(example.source, filename, "single",
>   File "", line 
> 1, in 
> df1.join(df2, df1["value"] == df2["value"]).count()
>   File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 151, in 
> count
> pdd = self.agg(_invoke_function("count", lit(1))).toPandas()
>   File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1031, 
> in toPandas
> return self._session.client.to_pandas(query)
>   File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in 
> to_pandas
> return self._execute_and_fetch(req)
>   File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in 
> _execute_and_fetch
> self._handle_error(rpc_error)
>   File "/.../spark/python/pyspark/sql/connect/client.py", line 619, in 
> _handle_error
> raise SparkConnectAnalysisException(
> pyspark.sql.connect.client.SparkConnectAnalysisException: 
> [AMBIGUOUS_REFERENCE] Reference `value` is ambiguous, could be: [`value`, 
> `value`].
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-02-20 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun resolved SPARK-41952.
--
Fix Version/s: 3.2.4
   3.4.0
   3.3.3
   Resolution: Fixed

> Upgrade Parquet to fix off-heap memory leaks in Zstd codec
> --
>
> Key: SPARK-41952
> URL: https://issues.apache.org/jira/browse/SPARK-41952
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.3, 3.3.1, 3.2.3
>Reporter: Alexey Kudinkin
>Assignee: Cheng Pan
>Priority: Critical
> Fix For: 3.2.4, 3.4.0, 3.3.3
>
>
> Recently, native memory leak have been discovered in Parquet in conjunction 
> of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).
> This is very problematic to a point where we can't use Parquet w/ Zstd due to 
> pervasive OOMs taking down our executors and disrupting our jobs.
> Luckily fix addressing this had already landed in Parquet:
> [https://github.com/apache/parquet-mr/pull/982]
>  
> Now, we just need to
>  # Updated version of Parquet is released in a timely manner
>  # Spark is upgraded onto this new version in the upcoming release
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec

2023-02-20 Thread Chao Sun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun reassigned SPARK-41952:


Assignee: Cheng Pan

> Upgrade Parquet to fix off-heap memory leaks in Zstd codec
> --
>
> Key: SPARK-41952
> URL: https://issues.apache.org/jira/browse/SPARK-41952
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.3, 3.3.1, 3.2.3
>Reporter: Alexey Kudinkin
>Assignee: Cheng Pan
>Priority: Critical
>
> Recently, native memory leak have been discovered in Parquet in conjunction 
> of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160).
> This is very problematic to a point where we can't use Parquet w/ Zstd due to 
> pervasive OOMs taking down our executors and disrupting our jobs.
> Luckily fix addressing this had already landed in Parquet:
> [https://github.com/apache/parquet-mr/pull/982]
>  
> Now, we just need to
>  # Updated version of Parquet is released in a timely manner
>  # Spark is upgraded onto this new version in the upcoming release
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42467) Spark Connect Scala Client: GroupBy and Aggregation

2023-02-20 Thread Rui Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691264#comment-17691264
 ] 

Rui Wang commented on SPARK-42467:
--

Yes we gonna need to support cube/rollup/groupingsets along with others 
necessary bits in Aggregation.

> Spark Connect Scala Client: GroupBy and Aggregation
> ---
>
> Key: SPARK-42467
> URL: https://issues.apache.org/jira/browse/SPARK-42467
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42503) Spark SQL should do further validation on join condition fields

2023-02-20 Thread zzzzming95 (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ming95 updated SPARK-42503:
---
Description: 
In Spark SQL, the conditions for the join use fields that are allowed to be 
fields from a non-left table or a non-right table. In this case, the join will 
degenerate into a cross join. 

Suppose you have two tables, test1 and test2, which have the same table schema:
{code:java}
```
CREATE TABLE `default`.`test1` (
  `id` INT,
  `name` STRING,
  `age` INT,
  `dt` STRING)
USING parquet
PARTITIONED BY (dt)
```{code}
The following SQL has three joins, but in the last left join, the conditions is 
`t1.name=t2.name`, and t3.name is not used. So the last left join will be cross 
join. 
{code:java}
```
select *
from 
(select * from test1 where dt="20230215"  and age=1 ) t1 
left join
(select * from test1 where dt=="20230215"  and age=2) t2
on t1.name=t2.name
left join
(select * from test2 where dt="20230215") t3
on
t1.name=t2.name;
```{code}
So i think Spark SQL should do further validation on join condition, the fields 
of join condition must be a left table or right table field , otherwise it is 
thrown `AnalysisException`.

 

  was:
In Spark SQL, the conditions for the join use fields that are allowed to be 
fields from a non-left table or a non-right table. In this case, the join will 
degenerate into a cross join. 

Suppose you have two tables, test1 and test2, which have the same table schema:

```
CREATE TABLE `default`.`test1` (
  `id` INT,
  `name` STRING,
  `age` INT,
  `dt` STRING)
USING parquet
PARTITIONED BY (dt)
```

The following SQL has three joins, but in the last left join, the conditions is 
`t1.name=t2.name`, and t3.name is not used. So the last left join will be cross 
join. 
```
select *
from 
(select * from test1 where dt="20230215"  and age=1 ) t1 
left join
(select * from test1 where dt=="20230215"  and age=2) t2
on t1.name=t2.name
left join
(select * from test2 where dt="20230215") t3
on
t1.name=t2.name;
```

So i think Spark SQL should do further validation on join condition, the fields 
of join condition must be a left table or right table field , otherwise it is 
thrown `AnalysisException`.

 


> Spark SQL should do further validation on join condition fields
> ---
>
> Key: SPARK-42503
> URL: https://issues.apache.org/jira/browse/SPARK-42503
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: ming95
>Priority: Major
> Fix For: 3.4.0
>
>
> In Spark SQL, the conditions for the join use fields that are allowed to be 
> fields from a non-left table or a non-right table. In this case, the join 
> will degenerate into a cross join. 
> Suppose you have two tables, test1 and test2, which have the same table 
> schema:
> {code:java}
> ```
> CREATE TABLE `default`.`test1` (
>   `id` INT,
>   `name` STRING,
>   `age` INT,
>   `dt` STRING)
> USING parquet
> PARTITIONED BY (dt)
> ```{code}
> The following SQL has three joins, but in the last left join, the conditions 
> is `t1.name=t2.name`, and t3.name is not used. So the last left join will be 
> cross join. 
> {code:java}
> ```
> select *
> from 
> (select * from test1 where dt="20230215"  and age=1 ) t1 
> left join
> (select * from test1 where dt=="20230215"  and age=2) t2
> on t1.name=t2.name
> left join
> (select * from test2 where dt="20230215") t3
> on
> t1.name=t2.name;
> ```{code}
> So i think Spark SQL should do further validation on join condition, the 
> fields of join condition must be a left table or right table field , 
> otherwise it is thrown `AnalysisException`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42503) Spark SQL should do further validation on join condition fields

2023-02-20 Thread zzzzming95 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691235#comment-17691235
 ] 

ming95 commented on SPARK-42503:


[~yumwang] [~gurwls223] cc

> Spark SQL should do further validation on join condition fields
> ---
>
> Key: SPARK-42503
> URL: https://issues.apache.org/jira/browse/SPARK-42503
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: ming95
>Priority: Major
> Fix For: 3.4.0
>
>
> In Spark SQL, the conditions for the join use fields that are allowed to be 
> fields from a non-left table or a non-right table. In this case, the join 
> will degenerate into a cross join. 
> Suppose you have two tables, test1 and test2, which have the same table 
> schema:
> ```
> CREATE TABLE `default`.`test1` (
>   `id` INT,
>   `name` STRING,
>   `age` INT,
>   `dt` STRING)
> USING parquet
> PARTITIONED BY (dt)
> ```
> The following SQL has three joins, but in the last left join, the conditions 
> is `t1.name=t2.name`, and t3.name is not used. So the last left join will be 
> cross join. 
> ```
> select *
> from 
> (select * from test1 where dt="20230215"  and age=1 ) t1 
> left join
> (select * from test1 where dt=="20230215"  and age=2) t2
> on t1.name=t2.name
> left join
> (select * from test2 where dt="20230215") t3
> on
> t1.name=t2.name;
> ```
> So i think Spark SQL should do further validation on join condition, the 
> fields of join condition must be a left table or right table field , 
> otherwise it is thrown `AnalysisException`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42503) Spark SQL should do further validation on join condition fields

2023-02-20 Thread zzzzming95 (Jira)

ming95 created SPARK-42503:
--

 Summary: Spark SQL should do further validation on join condition 
fields
 Key: SPARK-42503
 URL: https://issues.apache.org/jira/browse/SPARK-42503
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.2
Reporter: ming95
 Fix For: 3.4.0


In Spark SQL, the conditions for the join use fields that are allowed to be 
fields from a non-left table or a non-right table. In this case, the join will 
degenerate into a cross join. 

Suppose you have two tables, test1 and test2, which have the same table schema:

```
CREATE TABLE `default`.`test1` (
  `id` INT,
  `name` STRING,
  `age` INT,
  `dt` STRING)
USING parquet
PARTITIONED BY (dt)
```

The following SQL has three joins, but in the last left join, the conditions is 
`t1.name=t2.name`, and t3.name is not used. So the last left join will be cross 
join. 
```
select *
from 
(select * from test1 where dt="20230215"  and age=1 ) t1 
left join
(select * from test1 where dt=="20230215"  and age=2) t2
on t1.name=t2.name
left join
(select * from test2 where dt="20230215") t3
on
t1.name=t2.name;
```

So i think Spark SQL should do further validation on join condition, the fields 
of join condition must be a left table or right table field , otherwise it is 
thrown `AnalysisException`.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42502) scala: accept user_agent in spark connect's connection string

2023-02-20 Thread Niranjan Jayakar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranjan Jayakar updated SPARK-42502:
-
Description: 
Currently, the Spark Connect service's {{client_type}} attribute (which is 
really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark.

Accept an optional {{user_agent}} parameter in the connection string and plumb 
this down to the Spark Connect service.

This enables partners using Spark Connect to set their application as the user 
agent,
which then allows visibility and measurement of integrations and usages of spark
connect.

This is already done for the Python client: 
https://github.com/apache/spark/commit/b887d3de954ae5b2482087fe08affcc4ac60c669

  was:
Currently, the Spark Connect service's {{client_type}} attribute (which is 
really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark.

Accept an optional {{user_agent}} parameter in the connection string and plumb 
this down to the Spark Connect service.

This enables partners using Spark Connect to set their application as the user 
agent,
which then allows visibility and measurement of integrations and usages of spark
connect.


> scala: accept user_agent in spark connect's connection string
> -
>
> Key: SPARK-42502
> URL: https://issues.apache.org/jira/browse/SPARK-42502
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the Spark Connect service's {{client_type}} attribute (which is 
> really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark.
> Accept an optional {{user_agent}} parameter in the connection string and 
> plumb this down to the Spark Connect service.
> This enables partners using Spark Connect to set their application as the 
> user agent,
> which then allows visibility and measurement of integrations and usages of 
> spark
> connect.
> This is already done for the Python client: 
> https://github.com/apache/spark/commit/b887d3de954ae5b2482087fe08affcc4ac60c669



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42502) scala: accept user_agent in spark connect's connection string

2023-02-20 Thread Niranjan Jayakar (Jira)

Niranjan Jayakar created SPARK-42502:


 Summary: scala: accept user_agent in spark connect's connection 
string
 Key: SPARK-42502
 URL: https://issues.apache.org/jira/browse/SPARK-42502
 Project: Spark
  Issue Type: New Feature
  Components: Connect
Affects Versions: 3.3.2
Reporter: Niranjan Jayakar
Assignee: Niranjan Jayakar
 Fix For: 3.4.0


Currently, the Spark Connect service's {{client_type}} attribute (which is 
really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark.

Accept an optional {{user_agent}} parameter in the connection string and plumb 
this down to the Spark Connect service.

This enables partners using Spark Connect to set their application as the user 
agent,
which then allows visibility and measurement of integrations and usages of spark
connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42477) python: accept user_agent in spark connect's connection string

2023-02-20 Thread Niranjan Jayakar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranjan Jayakar updated SPARK-42477:
-
Summary:  python: accept user_agent in spark connect's connection string  
(was:  accept user_agent in spark connect's connection string)

>  python: accept user_agent in spark connect's connection string
> ---
>
> Key: SPARK-42477
> URL: https://issues.apache.org/jira/browse/SPARK-42477
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the Spark Connect service's {{client_type}} attribute (which is 
> really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark.
> Accept an optional {{user_agent}} parameter in the connection string and 
> plumb this down to the Spark Connect service.
> This enables partners using Spark Connect to set their application as the 
> user agent,
> which then allows visibility and measurement of integrations and usages of 
> spark
> connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42498) reduce spark connect service retry time

2023-02-20 Thread Niranjan Jayakar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranjan Jayakar resolved SPARK-42498.
--
Resolution: Abandoned

> reduce spark connect service retry time
> ---
>
> Key: SPARK-42498
> URL: https://issues.apache.org/jira/browse/SPARK-42498
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Priority: Major
>
> https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411
>  
> Currently, 15 retries with the current backoff strategy result in the client 
> sitting in
> the retry loop for ~400 seconds in the worst case. This means, applications 
> and
> users using the spark connect client will hang for >6 minutes with no 
> response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42498) reduce spark connect service retry time

2023-02-20 Thread Niranjan Jayakar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranjan Jayakar updated SPARK-42498:
-
Summary: reduce spark connect service retry time  (was: make spark connect 
retries configurat)

> reduce spark connect service retry time
> ---
>
> Key: SPARK-42498
> URL: https://issues.apache.org/jira/browse/SPARK-42498
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Priority: Major
>
> https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411
>  
> Currently, 15 retries with the current backoff strategy result in the client 
> sitting in
> the retry loop for ~400 seconds in the worst case. This means, applications 
> and
> users using the spark connect client will hang for >6 minutes with no 
> response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42498) make spark connect retries configurat

2023-02-20 Thread Niranjan Jayakar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niranjan Jayakar updated SPARK-42498:
-
Summary: make spark connect retries configurat  (was: reduce spark connect 
service retry time)

> make spark connect retries configurat
> -
>
> Key: SPARK-42498
> URL: https://issues.apache.org/jira/browse/SPARK-42498
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Priority: Major
>
> https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411
>  
> Currently, 15 retries with the current backoff strategy result in the client 
> sitting in
> the retry loop for ~400 seconds in the worst case. This means, applications 
> and
> users using the spark connect client will hang for >6 minutes with no 
> response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42423) Add metadata column file block start and length

2023-02-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42423:
---

Assignee: XiDuo You

> Add metadata column file block start and length
> ---
>
> Key: SPARK-42423
> URL: https://issues.apache.org/jira/browse/SPARK-42423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42423) Add metadata column file block start and length

2023-02-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42423.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39996
[https://github.com/apache/spark/pull/39996]

> Add metadata column file block start and length
> ---
>
> Key: SPARK-42423
> URL: https://issues.apache.org/jira/browse/SPARK-42423
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42476) Spark Connect API reference.

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42476:


Assignee: Haejoon Lee

> Spark Connect API reference.
> 
>
> Key: SPARK-42476
> URL: https://issues.apache.org/jira/browse/SPARK-42476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> We need an API documents for Spark Connect such as other components.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42476) Spark Connect API reference.

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42476.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40067
[https://github.com/apache/spark/pull/40067]

> Spark Connect API reference.
> 
>
> Key: SPARK-42476
> URL: https://issues.apache.org/jira/browse/SPARK-42476
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.4.0
>
>
> We need an API documents for Spark Connect such as other components.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42490) Upgrade protobuf-java to 3.22.0

2023-02-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-42490:


Assignee: Yang Jie

> Upgrade protobuf-java to 3.22.0
> ---
>
> Key: SPARK-42490
> URL: https://issues.apache.org/jira/browse/SPARK-42490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
>
> https://github.com/protocolbuffers/protobuf/releases/tag/v22.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42490) Upgrade protobuf-java to 3.22.0

2023-02-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42490.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40084
[https://github.com/apache/spark/pull/40084]

> Upgrade protobuf-java to 3.22.0
> ---
>
> Key: SPARK-42490
> URL: https://issues.apache.org/jira/browse/SPARK-42490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.5.0
>
>
> https://github.com/protocolbuffers/protobuf/releases/tag/v22.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42490) Upgrade protobuf-java to 3.22.0

2023-02-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-42490:
-
Priority: Minor  (was: Major)

> Upgrade protobuf-java to 3.22.0
> ---
>
> Key: SPARK-42490
> URL: https://issues.apache.org/jira/browse/SPARK-42490
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> https://github.com/protocolbuffers/protobuf/releases/tag/v22.0



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42489) Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0

2023-02-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42489.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40083
[https://github.com/apache/spark/pull/40083]

> Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0
> 
>
> Key: SPARK-42489
> URL: https://issues.apache.org/jira/browse/SPARK-42489
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.5.0
>
>
> https://github.com/scala/scala-parser-combinators/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42489) Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0

2023-02-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-42489:


Assignee: Yang Jie

> Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0
> 
>
> Key: SPARK-42489
> URL: https://issues.apache.org/jira/browse/SPARK-42489
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> https://github.com/scala/scala-parser-combinators/releases



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42477) accept user_agent in spark connect's connection string

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42477.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40054
[https://github.com/apache/spark/pull/40054]

>  accept user_agent in spark connect's connection string
> ---
>
> Key: SPARK-42477
> URL: https://issues.apache.org/jira/browse/SPARK-42477
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
> Fix For: 3.4.0
>
>
> Currently, the Spark Connect service's {{client_type}} attribute (which is 
> really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark.
> Accept an optional {{user_agent}} parameter in the connection string and 
> plumb this down to the Spark Connect service.
> This enables partners using Spark Connect to set their application as the 
> user agent,
> which then allows visibility and measurement of integrations and usages of 
> spark
> connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42477) accept user_agent in spark connect's connection string

2023-02-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42477:


Assignee: Niranjan Jayakar

>  accept user_agent in spark connect's connection string
> ---
>
> Key: SPARK-42477
> URL: https://issues.apache.org/jira/browse/SPARK-42477
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Assignee: Niranjan Jayakar
>Priority: Major
>
> Currently, the Spark Connect service's {{client_type}} attribute (which is 
> really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark.
> Accept an optional {{user_agent}} parameter in the connection string and 
> plumb this down to the Spark Connect service.
> This enables partners using Spark Connect to set their application as the 
> user agent,
> which then allows visibility and measurement of integrations and usages of 
> spark
> connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-42501) High level design doc for Spark ML

2023-02-20 Thread Weichen Xu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691169#comment-17691169
 ] 

Weichen Xu edited comment on SPARK-42501 at 2/20/23 1:25 PM:
-

The doc is not ready yet. :)


was (Author: weichenxu123):
CC [~mengxr] [~grundprinzip-db] [~podongfeng] [~srowen] Thanks!

> High level design doc for Spark ML
> --
>
> Key: SPARK-42501
> URL: https://issues.apache.org/jira/browse/SPARK-42501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation, ML
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42501) High level design doc for Spark ML

2023-02-20 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-42501:
---
Description: (was: Please find the HLD doc for spark ML via spark 
connect 
[here|https://docs.google.com/document/d/16_l3wXwbyPl6VwA0zOSdlrEymWVQIp5wfT9-MwKrmQU/edit?usp=sharing].
 )

> High level design doc for Spark ML
> --
>
> Key: SPARK-42501
> URL: https://issues.apache.org/jira/browse/SPARK-42501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation, ML
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42501) High level design doc for Spark ML

2023-02-20 Thread Weichen Xu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691169#comment-17691169
 ] 

Weichen Xu commented on SPARK-42501:


CC [~mengxr] [~grundprinzip-db] [~podongfeng] [~srowen] Thanks!

> High level design doc for Spark ML
> --
>
> Key: SPARK-42501
> URL: https://issues.apache.org/jira/browse/SPARK-42501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation, ML
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Please find the HLD doc for spark ML via spark connect 
> [here|https://docs.google.com/document/d/16_l3wXwbyPl6VwA0zOSdlrEymWVQIp5wfT9-MwKrmQU/edit?usp=sharing].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42501) High level design doc for Spark ML

2023-02-20 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-42501:
--

Assignee: Weichen Xu

> High level design doc for Spark ML
> --
>
> Key: SPARK-42501
> URL: https://issues.apache.org/jira/browse/SPARK-42501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation, ML
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Please find the HLD doc for spark ML via spark connect 
> [here|https://docs.google.com/document/d/16_l3wXwbyPl6VwA0zOSdlrEymWVQIp5wfT9-MwKrmQU/edit?usp=sharing].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42501) High level design doc for Spark ML

2023-02-20 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-42501:
---
Description: Please find the HLD doc for spark ML via spark connect 
[here|https://docs.google.com/document/d/16_l3wXwbyPl6VwA0zOSdlrEymWVQIp5wfT9-MwKrmQU/edit?usp=sharing].
 

> High level design doc for Spark ML
> --
>
> Key: SPARK-42501
> URL: https://issues.apache.org/jira/browse/SPARK-42501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation, ML
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Priority: Major
>
> Please find the HLD doc for spark ML via spark connect 
> [here|https://docs.google.com/document/d/16_l3wXwbyPl6VwA0zOSdlrEymWVQIp5wfT9-MwKrmQU/edit?usp=sharing].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42501) High level design doc for Spark ML

2023-02-20 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-42501:
---
Component/s: Connect
 Documentation

> High level design doc for Spark ML
> --
>
> Key: SPARK-42501
> URL: https://issues.apache.org/jira/browse/SPARK-42501
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Documentation, ML
>Affects Versions: 3.4.0
>Reporter: Weichen Xu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42501) High level design doc for Spark ML

2023-02-20 Thread Weichen Xu (Jira)

Weichen Xu created SPARK-42501:
--

 Summary: High level design doc for Spark ML
 Key: SPARK-42501
 URL: https://issues.apache.org/jira/browse/SPARK-42501
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.4.0
Reporter: Weichen Xu






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8

2023-02-20 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-41741.
-
Resolution: Fixed

> [SQL] ParquetFilters StringStartsWith push down matching string do not use 
> UTF-8
> 
>
> Key: SPARK-41741
> URL: https://issues.apache.org/jira/browse/SPARK-41741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiale He
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.3
>
> Attachments: image-2022-12-28-18-00-00-861.png, 
> image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, 
> image-2023-01-09-18-27-53-479.png, 
> part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
>
>
> Hello ~
>  
> I found a problem, but there are two ways to solve it.
>  
> The parquet filter is pushed down. When using the like '***%' statement to 
> query, if the system default encoding is not UTF-8, it may cause an error.
>  
> There are two ways to bypass this problem as far as I know
> 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
> 2. spark.sql.parquet.filterPushdown.string.startsWith=false
>  
> The following is the information to reproduce this problem
> The parquet sample file is in the attachment
> {code:java}
> spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
> spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code}
> !image-2022-12-28-18-00-00-861.png|width=879,height=430!
>  
>   !image-2022-12-28-18-00-21-586.png|width=799,height=731!
>  
> I think the correct code should be:
> {code:java}
> private val strToBinary = 
> Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8

2023-02-20 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-41741:
---

Fix Version/s: 3.4.0
   3.3.3
 Assignee: Yuming Wang

> [SQL] ParquetFilters StringStartsWith push down matching string do not use 
> UTF-8
> 
>
> Key: SPARK-41741
> URL: https://issues.apache.org/jira/browse/SPARK-41741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jiale He
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.4.0, 3.3.3
>
> Attachments: image-2022-12-28-18-00-00-861.png, 
> image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, 
> image-2023-01-09-18-27-53-479.png, 
> part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet
>
>
> Hello ~
>  
> I found a problem, but there are two ways to solve it.
>  
> The parquet filter is pushed down. When using the like '***%' statement to 
> query, if the system default encoding is not UTF-8, it may cause an error.
>  
> There are two ways to bypass this problem as far as I know
> 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8"
> 2. spark.sql.parquet.filterPushdown.string.startsWith=false
>  
> The following is the information to reproduce this problem
> The parquet sample file is in the attachment
> {code:java}
> spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”)
> spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code}
> !image-2022-12-28-18-00-00-861.png|width=879,height=430!
>  
>   !image-2022-12-28-18-00-21-586.png|width=799,height=731!
>  
> I think the correct code should be:
> {code:java}
> private val strToBinary = 
> Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42500) ConstantPropagation support more cases

2023-02-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691136#comment-17691136
 ] 

Apache Spark commented on SPARK-42500:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/40093

> ConstantPropagation support more cases
> --
>
> Key: SPARK-42500
> URL: https://issues.apache.org/jira/browse/SPARK-42500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42500) ConstantPropagation support more cases

2023-02-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42500:


Assignee: Apache Spark

> ConstantPropagation support more cases
> --
>
> Key: SPARK-42500
> URL: https://issues.apache.org/jira/browse/SPARK-42500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42500) ConstantPropagation support more cases

2023-02-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42500:


Assignee: (was: Apache Spark)

> ConstantPropagation support more cases
> --
>
> Key: SPARK-42500
> URL: https://issues.apache.org/jira/browse/SPARK-42500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42500) ConstantPropagation support more cases

2023-02-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691135#comment-17691135
 ] 

Apache Spark commented on SPARK-42500:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/40093

> ConstantPropagation support more cases
> --
>
> Key: SPARK-42500
> URL: https://issues.apache.org/jira/browse/SPARK-42500
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42500) ConstantPropagation support more cases

2023-02-20 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-42500:
---

 Summary: ConstantPropagation support more cases
 Key: SPARK-42500
 URL: https://issues.apache.org/jira/browse/SPARK-42500
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: Yuming Wang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-42486) Upgrade ZooKeeper from 3.6.3 to 3.6.4

2023-02-20 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-42486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bjørn Jørgensen updated SPARK-42486:

Affects Version/s: 3.4.0

> Upgrade ZooKeeper from 3.6.3 to 3.6.4
> -
>
> Key: SPARK-42486
> URL: https://issues.apache.org/jira/browse/SPARK-42486
> Project: Spark
>  Issue Type: Dependency upgrade
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> [ZooKeeper 3.6 is EoL since 30th December, 
> 2022|https://zookeeper.apache.org/releases.html]
> [Release notes|https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-42499) Support for Runtime SQL configuration

2023-02-20 Thread Ruifeng Zheng (Jira)

Ruifeng Zheng created SPARK-42499:
-

 Summary: Support for Runtime SQL configuration
 Key: SPARK-42499
 URL: https://issues.apache.org/jira/browse/SPARK-42499
 Project: Spark
  Issue Type: Umbrella
  Components: Connect, PySpark
Affects Versions: 3.4.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-41959) Improve v1 writes with empty2null

2023-02-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-41959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-41959.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39475
[https://github.com/apache/spark/pull/39475]

> Improve v1 writes with empty2null
> -
>
> Key: SPARK-41959
> URL: https://issues.apache.org/jira/browse/SPARK-41959
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Priority: Major
>  Labels: correctness
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41324) Follow-up on JDK-8180450

2023-02-20 Thread Yang Jie (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-41324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691072#comment-17691072
 ] 

Yang Jie commented on SPARK-41324:
--

>From the affected versions of 
>[JDK-8180450|https://bugs.openjdk.org/browse/JDK-8180450], if Java 8 is used, 
>will it still be affected?

> Follow-up on JDK-8180450
> 
>
> Key: SPARK-41324
> URL: https://issues.apache.org/jira/browse/SPARK-41324
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.2, 3.3.1
>Reporter: Herman van Hövell
>Priority: Major
>
> Per [https://twitter.com/forked_franz/status/1597468851968831489]
> We should follow-up on: [https://bugs.openjdk.org/browse/JDK-8180450]
> There are two concrete tasks here:
>  # Upgrade to Netty 4.1.84.
>  # (Optional) Write a benchmark that exercises this code path. Anchoring this 
> in the build will be a bit of challenge though.
>  # Check if there are other places where this bug manifests itself.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-42398) refine default column value framework

2023-02-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42398.
-
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40049
[https://github.com/apache/spark/pull/40049]

> refine default column value framework
> -
>
> Key: SPARK-42398
> URL: https://issues.apache.org/jira/browse/SPARK-42398
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42398) refine default column value framework

2023-02-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42398:
---

Assignee: Wenchen Fan

> refine default column value framework
> -
>
> Key: SPARK-42398
> URL: https://issues.apache.org/jira/browse/SPARK-42398
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-42498) reduce spark connect service retry time

2023-02-20 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691056#comment-17691056
 ] 

Apache Spark commented on SPARK-42498:
--

User 'nija-at' has created a pull request for this issue:
https://github.com/apache/spark/pull/40066

> reduce spark connect service retry time
> ---
>
> Key: SPARK-42498
> URL: https://issues.apache.org/jira/browse/SPARK-42498
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Priority: Major
>
> https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411
>  
> Currently, 15 retries with the current backoff strategy result in the client 
> sitting in
> the retry loop for ~400 seconds in the worst case. This means, applications 
> and
> users using the spark connect client will hang for >6 minutes with no 
> response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42498) reduce spark connect service retry time

2023-02-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42498:


Assignee: Apache Spark

> reduce spark connect service retry time
> ---
>
> Key: SPARK-42498
> URL: https://issues.apache.org/jira/browse/SPARK-42498
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Assignee: Apache Spark
>Priority: Major
>
> https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411
>  
> Currently, 15 retries with the current backoff strategy result in the client 
> sitting in
> the retry loop for ~400 seconds in the worst case. This means, applications 
> and
> users using the spark connect client will hang for >6 minutes with no 
> response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-42498) reduce spark connect service retry time

2023-02-20 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42498:


Assignee: (was: Apache Spark)

> reduce spark connect service retry time
> ---
>
> Key: SPARK-42498
> URL: https://issues.apache.org/jira/browse/SPARK-42498
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.3.2
>Reporter: Niranjan Jayakar
>Priority: Major
>
> https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411
>  
> Currently, 15 retries with the current backoff strategy result in the client 
> sitting in
> the retry loop for ~400 seconds in the worst case. This means, applications 
> and
> users using the spark connect client will hang for >6 minutes with no 
> response.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

87 matches

Mail list logo