[jira] [Created] (SPARK-42508) Extract the common .ml classes to `mllib-common`
Ruifeng Zheng created SPARK-42508: - Summary: Extract the common .ml classes to `mllib-common` Key: SPARK-42508 URL: https://issues.apache.org/jira/browse/SPARK-42508 Project: Spark Issue Type: Sub-task Components: Connect, ML Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42507) Simplify ORC schema merging conflict error check
[ https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691420#comment-17691420 ] Apache Spark commented on SPARK-42507: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40101 > Simplify ORC schema merging conflict error check > > > Key: SPARK-42507 > URL: https://issues.apache.org/jira/browse/SPARK-42507 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42507) Simplify ORC schema merging conflict error check
[ https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42507: Assignee: (was: Apache Spark) > Simplify ORC schema merging conflict error check > > > Key: SPARK-42507 > URL: https://issues.apache.org/jira/browse/SPARK-42507 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42507) Simplify ORC schema merging conflict error check
[ https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691419#comment-17691419 ] Apache Spark commented on SPARK-42507: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40101 > Simplify ORC schema merging conflict error check > > > Key: SPARK-42507 > URL: https://issues.apache.org/jira/browse/SPARK-42507 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42507) Simplify ORC schema merging conflict error check
[ https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42507: Assignee: Apache Spark > Simplify ORC schema merging conflict error check > > > Key: SPARK-42507 > URL: https://issues.apache.org/jira/browse/SPARK-42507 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42507) Simplify ORC schema merging conflict error check
[ https://issues.apache.org/jira/browse/SPARK-42507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42507: -- Summary: Simplify ORC schema merging conflict error check (was: Simplify schema merging conflict error check) > Simplify ORC schema merging conflict error check > > > Key: SPARK-42507 > URL: https://issues.apache.org/jira/browse/SPARK-42507 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42507) Simplify schema merging conflict error check
Dongjoon Hyun created SPARK-42507: - Summary: Simplify schema merging conflict error check Key: SPARK-42507 URL: https://issues.apache.org/jira/browse/SPARK-42507 Project: Spark Issue Type: Test Components: SQL, Tests Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37099) Introduce a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37099: --- Assignee: jiaan.geng > Introduce a rank-based filter to optimize top-k computation > --- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.5.0 > > Attachments: q67.png, q67_optimized.png, skewed_window.png > > > in JD, we found that more than 90% usage of window function follows this > pattern: > {code:java} > select (... (row_number|rank|dense_rank) () over( [partition by ...] order > by ... ) as rn) > where rn (==|<|<=) k and other conditions{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > For these three rank functions (row_number|rank|dense_rank), the rank of a > key computed on partitial dataset is always <= its final rank computed on > the whole dataset. so we can safely discard rows with partitial rank > k, > anywhere. > > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37099) Introduce a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37099. - Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 38799 [https://github.com/apache/spark/pull/38799] > Introduce a rank-based filter to optimize top-k computation > --- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > Fix For: 3.5.0 > > Attachments: q67.png, q67_optimized.png, skewed_window.png > > > in JD, we found that more than 90% usage of window function follows this > pattern: > {code:java} > select (... (row_number|rank|dense_rank) () over( [partition by ...] order > by ... ) as rn) > where rn (==|<|<=) k and other conditions{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > For these three rank functions (row_number|rank|dense_rank), the rank of a > key computed on partitial dataset is always <= its final rank computed on > the whole dataset. so we can safely discard rows with partitial rank > k, > anywhere. > > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42506) Fix Sort's maxRowsPerPartition if maxRows does not exist
[ https://issues.apache.org/jira/browse/SPARK-42506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42506: Assignee: (was: Apache Spark) > Fix Sort's maxRowsPerPartition if maxRows does not exist > > > Key: SPARK-42506 > URL: https://issues.apache.org/jira/browse/SPARK-42506 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42506) Fix Sort's maxRowsPerPartition if maxRows does not exist
[ https://issues.apache.org/jira/browse/SPARK-42506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42506: Assignee: Apache Spark > Fix Sort's maxRowsPerPartition if maxRows does not exist > > > Key: SPARK-42506 > URL: https://issues.apache.org/jira/browse/SPARK-42506 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42506) Fix Sort's maxRowsPerPartition if maxRows does not exist
[ https://issues.apache.org/jira/browse/SPARK-42506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691414#comment-17691414 ] Apache Spark commented on SPARK-42506: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40100 > Fix Sort's maxRowsPerPartition if maxRows does not exist > > > Key: SPARK-42506 > URL: https://issues.apache.org/jira/browse/SPARK-42506 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42506) Fix Sort's maxRowsPerPartition if maxRows does not exist
Yuming Wang created SPARK-42506: --- Summary: Fix Sort's maxRowsPerPartition if maxRows does not exist Key: SPARK-42506 URL: https://issues.apache.org/jira/browse/SPARK-42506 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42505) Apply entrypoint template change to 3.3.0/3.3.1
Yikun Jiang created SPARK-42505: --- Summary: Apply entrypoint template change to 3.3.0/3.3.1 Key: SPARK-42505 URL: https://issues.apache.org/jira/browse/SPARK-42505 Project: Spark Issue Type: Sub-task Components: Spark Docker Affects Versions: 3.5.0 Reporter: Yikun Jiang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42494) Add official image Dockerfile for Spark v3.3.2
[ https://issues.apache.org/jira/browse/SPARK-42494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang resolved SPARK-42494. - Resolution: Fixed Resolved by https://github.com/apache/spark-docker/pull/30 > Add official image Dockerfile for Spark v3.3.2 > -- > > Key: SPARK-42494 > URL: https://issues.apache.org/jira/browse/SPARK-42494 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.3.2 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42494) Add official image Dockerfile for Spark v3.3.2
[ https://issues.apache.org/jira/browse/SPARK-42494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang reassigned SPARK-42494: --- Assignee: Yikun Jiang > Add official image Dockerfile for Spark v3.3.2 > -- > > Key: SPARK-42494 > URL: https://issues.apache.org/jira/browse/SPARK-42494 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.3.2 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed
[ https://issues.apache.org/jira/browse/SPARK-40278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691395#comment-17691395 ] Yang Jie commented on SPARK-40278: -- SQL not failed, UI may break. [~ulysses] explained this at [https://github.com/apache/spark/pull/35149#issuecomment-1231712806] and he should have tried to fix the issue, but I'm not sure whether it has been fixed > Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed > -- > > Key: SPARK-40278 > URL: https://issues.apache.org/jira/browse/SPARK-40278 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, > the test code as follows: > {code:java} > val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T" > val databaseName = "tpcds_database" > val scaleFactor = "3072" > val format = "parquet" > import com.databricks.spark.sql.perf.tpcds.TPCDSTables > val tables = new TPCDSTables( > spark.sqlContext,dsdgenDir = "./tpcds-kit/tools", > scaleFactor = scaleFactor, > useDoubleForDecimal = false,useStringForDate = false) > spark.sql(s"create database $databaseName") > tables.createTemporaryTables(rootDir, format) > spark.sql(s"use $databaseName")// TPCDS 24a or 24b > val result = spark.sql(""" with ssales as > (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, > i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) > netpaid > from store_sales, store_returns, store, item, customer, customer_address > where ss_ticket_number = sr_ticket_number > and ss_item_sk = sr_item_sk > and ss_customer_sk = c_customer_sk > and ss_item_sk = i_item_sk > and ss_store_sk = s_store_sk > and c_birth_country = upper(ca_country) > and s_zip = ca_zip > and s_market_id = 8 > group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, > i_current_price, i_manager_id, i_units, i_size) > select c_last_name, c_first_name, s_store_name, sum(netpaid) paid > from ssales > where i_color = 'pale' > group by c_last_name, c_first_name, s_store_name > having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect() > sc.stop() {code} > The above test may failed due to `Stage cancelled because SparkContext was > shut down` of stage 31 and stage 36 when AQE enabled as follows: > > !image-2022-08-30-21-09-48-763.png! > !image-2022-08-30-21-10-24-862.png! > !image-2022-08-30-21-10-57-128.png! > > The DAG corresponding to sql is as follows: > !image-2022-08-30-21-11-50-895.png! > The details as follows: > > > {code:java} > == Physical Plan == > AdaptiveSparkPlan (42) > +- == Final Plan == >LocalTableScan (1) > +- == Initial Plan == >Filter (41) >+- HashAggregate (40) > +- Exchange (39) > +- HashAggregate (38) > +- HashAggregate (37) >+- Exchange (36) > +- HashAggregate (35) > +- Project (34) > +- BroadcastHashJoin Inner BuildRight (33) >:- Project (29) >: +- BroadcastHashJoin Inner BuildRight (28) >: :- Project (24) >: : +- BroadcastHashJoin Inner BuildRight (23) >: : :- Project (19) >: : : +- BroadcastHashJoin Inner > BuildRight (18) >: : : :- Project (13) >: : : : +- SortMergeJoin Inner (12) >: : : : :- Sort (6) >: : : : : +- Exchange (5) >: : : : : +- Project (4) >: : : : :+- Filter (3) >: : : : : +- Scan > parquet (2) >: : : : +- Sort (11) >: : : :+- Exchange (10) >: : : : +- Project (9) >: : : : +- Filter (8) >: : : : +- Scan > parquet (7) >: : : +- BroadcastExchange (17) >: : :+- Project (16) >: : : +- Filter (15) >: : : +- Scan parquet (14) >: :
[jira] [Comment Edited] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed
[ https://issues.apache.org/jira/browse/SPARK-40278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691395#comment-17691395 ] Yang Jie edited comment on SPARK-40278 at 2/21/23 5:49 AM: --- SQL not failed, UI may break. [~ulysses] explained this at [https://github.com/apache/spark/pull/35149#issuecomment-1231712806] and he should have tried to fix the issue, but I'm not sure whether it has been fixed was (Author: luciferyang): SQL not failed, UI may break. [~ulysses] explained this at [https://github.com/apache/spark/pull/35149#issuecomment-1231712806] and he should have tried to fix the issue, but I'm not sure whether it has been fixed > Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed > -- > > Key: SPARK-40278 > URL: https://issues.apache.org/jira/browse/SPARK-40278 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, > the test code as follows: > {code:java} > val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T" > val databaseName = "tpcds_database" > val scaleFactor = "3072" > val format = "parquet" > import com.databricks.spark.sql.perf.tpcds.TPCDSTables > val tables = new TPCDSTables( > spark.sqlContext,dsdgenDir = "./tpcds-kit/tools", > scaleFactor = scaleFactor, > useDoubleForDecimal = false,useStringForDate = false) > spark.sql(s"create database $databaseName") > tables.createTemporaryTables(rootDir, format) > spark.sql(s"use $databaseName")// TPCDS 24a or 24b > val result = spark.sql(""" with ssales as > (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, > i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) > netpaid > from store_sales, store_returns, store, item, customer, customer_address > where ss_ticket_number = sr_ticket_number > and ss_item_sk = sr_item_sk > and ss_customer_sk = c_customer_sk > and ss_item_sk = i_item_sk > and ss_store_sk = s_store_sk > and c_birth_country = upper(ca_country) > and s_zip = ca_zip > and s_market_id = 8 > group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, > i_current_price, i_manager_id, i_units, i_size) > select c_last_name, c_first_name, s_store_name, sum(netpaid) paid > from ssales > where i_color = 'pale' > group by c_last_name, c_first_name, s_store_name > having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect() > sc.stop() {code} > The above test may failed due to `Stage cancelled because SparkContext was > shut down` of stage 31 and stage 36 when AQE enabled as follows: > > !image-2022-08-30-21-09-48-763.png! > !image-2022-08-30-21-10-24-862.png! > !image-2022-08-30-21-10-57-128.png! > > The DAG corresponding to sql is as follows: > !image-2022-08-30-21-11-50-895.png! > The details as follows: > > > {code:java} > == Physical Plan == > AdaptiveSparkPlan (42) > +- == Final Plan == >LocalTableScan (1) > +- == Initial Plan == >Filter (41) >+- HashAggregate (40) > +- Exchange (39) > +- HashAggregate (38) > +- HashAggregate (37) >+- Exchange (36) > +- HashAggregate (35) > +- Project (34) > +- BroadcastHashJoin Inner BuildRight (33) >:- Project (29) >: +- BroadcastHashJoin Inner BuildRight (28) >: :- Project (24) >: : +- BroadcastHashJoin Inner BuildRight (23) >: : :- Project (19) >: : : +- BroadcastHashJoin Inner > BuildRight (18) >: : : :- Project (13) >: : : : +- SortMergeJoin Inner (12) >: : : : :- Sort (6) >: : : : : +- Exchange (5) >: : : : : +- Project (4) >: : : : :+- Filter (3) >: : : : : +- Scan > parquet (2) >: : : : +- Sort (11) >: : : :+- Exchange (10) >: : : : +- Project (9) >: : : : +- Filter (8) >: : : : +- Scan > parquet (7) >
[jira] [Commented] (SPARK-42503) Spark SQL should do further validation on join condition fields
[ https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691391#comment-17691391 ] ming95 commented on SPARK-42503: I test hive and mysql , they also do not have this validation . But I still think this restriction should be added, because the condition of non-left and right tables in join is meaningless. > Spark SQL should do further validation on join condition fields > --- > > Key: SPARK-42503 > URL: https://issues.apache.org/jira/browse/SPARK-42503 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > > In Spark SQL, the conditions for the join use fields that are allowed to be > fields from a non-left table or a non-right table. In this case, the join > will degenerate into a cross join. > Suppose you have two tables, test1 and test2, which have the same table > schema: > {code:java} > ``` > CREATE TABLE `default`.`test1` ( > `id` INT, > `name` STRING, > `age` INT, > `dt` STRING) > USING parquet > PARTITIONED BY (dt) > ```{code} > The following SQL has three joins, but in the last left join, the conditions > is `t1.name=t2.name`, and t3.name is not used. So the last left join will be > cross join. > {code:java} > ``` > select * > from > (select * from test1 where dt="20230215" and age=1 ) t1 > left join > (select * from test1 where dt=="20230215" and age=2) t2 > on t1.name=t2.name > left join > (select * from test2 where dt="20230215") t3 > on > t1.name=t2.name; > ```{code} > So i think Spark SQL should do further validation on join condition, the > fields of join condition must be a left table or right table field , > otherwise it is thrown `AnalysisException`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40278) Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed
[ https://issues.apache.org/jira/browse/SPARK-40278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691388#comment-17691388 ] Yuming Wang commented on SPARK-40278: - [~LuciferYang] Is this issue still exist? > Used databricks spark-sql-pref with Spark 3.3 to run 3TB tpcds test failed > -- > > Key: SPARK-40278 > URL: https://issues.apache.org/jira/browse/SPARK-40278 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Priority: Major > > I used databricks spark-sql-pref + Spark 3.3 to run 3TB TPCDS q24a or q24b, > the test code as follows: > {code:java} > val rootDir = "hdfs://${clusterName}/tpcds-data/POCGenData3T" > val databaseName = "tpcds_database" > val scaleFactor = "3072" > val format = "parquet" > import com.databricks.spark.sql.perf.tpcds.TPCDSTables > val tables = new TPCDSTables( > spark.sqlContext,dsdgenDir = "./tpcds-kit/tools", > scaleFactor = scaleFactor, > useDoubleForDecimal = false,useStringForDate = false) > spark.sql(s"create database $databaseName") > tables.createTemporaryTables(rootDir, format) > spark.sql(s"use $databaseName")// TPCDS 24a or 24b > val result = spark.sql(""" with ssales as > (select c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, > i_current_price, i_manager_id, i_units, i_size, sum(ss_net_paid) > netpaid > from store_sales, store_returns, store, item, customer, customer_address > where ss_ticket_number = sr_ticket_number > and ss_item_sk = sr_item_sk > and ss_customer_sk = c_customer_sk > and ss_item_sk = i_item_sk > and ss_store_sk = s_store_sk > and c_birth_country = upper(ca_country) > and s_zip = ca_zip > and s_market_id = 8 > group by c_last_name, c_first_name, s_store_name, ca_state, s_state, i_color, > i_current_price, i_manager_id, i_units, i_size) > select c_last_name, c_first_name, s_store_name, sum(netpaid) paid > from ssales > where i_color = 'pale' > group by c_last_name, c_first_name, s_store_name > having sum(netpaid) > (select 0.05*avg(netpaid) from ssales)""").collect() > sc.stop() {code} > The above test may failed due to `Stage cancelled because SparkContext was > shut down` of stage 31 and stage 36 when AQE enabled as follows: > > !image-2022-08-30-21-09-48-763.png! > !image-2022-08-30-21-10-24-862.png! > !image-2022-08-30-21-10-57-128.png! > > The DAG corresponding to sql is as follows: > !image-2022-08-30-21-11-50-895.png! > The details as follows: > > > {code:java} > == Physical Plan == > AdaptiveSparkPlan (42) > +- == Final Plan == >LocalTableScan (1) > +- == Initial Plan == >Filter (41) >+- HashAggregate (40) > +- Exchange (39) > +- HashAggregate (38) > +- HashAggregate (37) >+- Exchange (36) > +- HashAggregate (35) > +- Project (34) > +- BroadcastHashJoin Inner BuildRight (33) >:- Project (29) >: +- BroadcastHashJoin Inner BuildRight (28) >: :- Project (24) >: : +- BroadcastHashJoin Inner BuildRight (23) >: : :- Project (19) >: : : +- BroadcastHashJoin Inner > BuildRight (18) >: : : :- Project (13) >: : : : +- SortMergeJoin Inner (12) >: : : : :- Sort (6) >: : : : : +- Exchange (5) >: : : : : +- Project (4) >: : : : :+- Filter (3) >: : : : : +- Scan > parquet (2) >: : : : +- Sort (11) >: : : :+- Exchange (10) >: : : : +- Project (9) >: : : : +- Filter (8) >: : : : +- Scan > parquet (7) >: : : +- BroadcastExchange (17) >: : :+- Project (16) >: : : +- Filter (15) >: : : +- Scan parquet (14) >: : +- BroadcastExchange (22) >: :+- Filter (21) >: : +- Scan parquet (20) >
[jira] [Commented] (SPARK-40610) Spark fall back to use getPartitions instead of getPartitionsByFilter when date_add functions used in where clause
[ https://issues.apache.org/jira/browse/SPARK-40610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691387#comment-17691387 ] Yuming Wang commented on SPARK-40610: - [~icyjhl] What's your dt data type? date, string or timestamp? > Spark fall back to use getPartitions instead of getPartitionsByFilter when > date_add functions used in where clause > --- > > Key: SPARK-40610 > URL: https://issues.apache.org/jira/browse/SPARK-40610 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.1 > Environment: edw.tmp_test_metastore_usage_source is a big table with > 1000 partitions and hundreds of columns >Reporter: icyjhl >Priority: Major > Attachments: spark_error.log, spark_sql.sql, sql_in_mysql.sql > > > When I run a insert overwrite statement, I got error saying: > > {code:java} > MetaStoreClient lost connection. Attempting to reconnect (1 of 1) after 1s. > listPartitions {code} > > It's weird as I only selected for about 3 partitions, so I rerun the sql and > checked the metastore, then I found it's fetching all columns in all > partitions: > > {code:java} > select "CD_ID", "COMMENT", "COLUMN_NAME", "TYPE_NAME" from "COLUMNS_V2" where > "CD_ID" > in > (675384,675393,675385,675394,675396,675397,675395,675398,675399,675401,675402,675400,675406……){code} > > > After testing, I found the problem is with the date_add function in where > clause, if remove it ,sql works fine, else metastore would fetch all columns > in all partitions. > > > {code:java} > insert overwrite table test.tmp_test_metastore_usage > SELECT userid > ,SUBSTR(sendtime,1,10) AS creation_date > ,cast(json_bh_esdate_deltadays_max as DECIMAL(38,2)) AS > bh_esdate_deltadays_max > ,json_bh_qiye_industryphyname AS bh_qiye_industryphyname > ,cast(json_bh_esdate_deltadays_min as DECIMAL(38,2)) AS > bh_esdate_deltadays_min > ,cast(json_bh_subconam_min as DECIMAL(38,2)) AS bh_subconam_min > ,cast(json_bh_qiye_regcap_min as DECIMAL(38,2)) AS bh_qiye_regcap_min > ,json_bh_industryphyname AS bh_industryphyname > ,cast(json_bh_subconam_mean as DECIMAL(38,2)) AS bh_subconam_mean > ,cast(json_bh_industryphyname_nunique as DECIMAL(38,2)) AS > bh_industryphyname_nunique > ,cast(current_timestamp() as string) as dw_cre_date > ,cast(current_timestamp() as string) as dw_upd_date > FROM ( > SELECT userid > ,sendtime > ,json_bh_esdate_deltadays_max > ,json_bh_qiye_industryphyname > ,json_bh_esdate_deltadays_min > ,json_bh_subconam_min > ,json_bh_qiye_regcap_min > ,json_bh_industryphyname > ,json_bh_subconam_mean > ,json_bh_industryphyname_nunique > ,row_number() OVER ( > PARTITION BY userid,dt ORDER BY sendtime DESC > ) rn > FROM edw.tmp_test_metastore_usage_source > WHERE dt >= date_add('2022-09-22',-3 ) > AND json_bizid IN ('6101') > AND json_dingid IN ('611') > ) t > WHERE rn = 1 {code} > > By the way 2.4.7 works good. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects
[ https://issues.apache.org/jira/browse/SPARK-42504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42504: Assignee: Apache Spark > NestedColumnAliasing support pruning adjacent projects > -- > > Key: SPARK-42504 > URL: https://issues.apache.org/jira/browse/SPARK-42504 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > CollapseProject won't combine adjacent projects into one, e.g. non-cheap > expression has been accessed more than once with the below project. Then > there would be possible to appear some adjacent project nodes that > NestedColumnAliasing does not support pruning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects
[ https://issues.apache.org/jira/browse/SPARK-42504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691354#comment-17691354 ] Apache Spark commented on SPARK-42504: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/40098 > NestedColumnAliasing support pruning adjacent projects > -- > > Key: SPARK-42504 > URL: https://issues.apache.org/jira/browse/SPARK-42504 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > CollapseProject won't combine adjacent projects into one, e.g. non-cheap > expression has been accessed more than once with the below project. Then > there would be possible to appear some adjacent project nodes that > NestedColumnAliasing does not support pruning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects
[ https://issues.apache.org/jira/browse/SPARK-42504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42504: Assignee: (was: Apache Spark) > NestedColumnAliasing support pruning adjacent projects > -- > > Key: SPARK-42504 > URL: https://issues.apache.org/jira/browse/SPARK-42504 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > CollapseProject won't combine adjacent projects into one, e.g. non-cheap > expression has been accessed more than once with the below project. Then > there would be possible to appear some adjacent project nodes that > NestedColumnAliasing does not support pruning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects
[ https://issues.apache.org/jira/browse/SPARK-42504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] XiDuo You updated SPARK-42504: -- Description: CollapseProject won't combine adjacent projects into one, e.g. non-cheap expression has been accessed more than once with the below project. Then there would be possible to appear some adjacent project nodes that NestedColumnAliasing does not support pruning. was: CollapseProject won't combine adjacent projects into one, e.g. non-cheap expression has been accessed more than once with the below project. > NestedColumnAliasing support pruning adjacent projects > -- > > Key: SPARK-42504 > URL: https://issues.apache.org/jira/browse/SPARK-42504 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > > CollapseProject won't combine adjacent projects into one, e.g. non-cheap > expression has been accessed more than once with the below project. Then > there would be possible to appear some adjacent project nodes that > NestedColumnAliasing does not support pruning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42504) NestedColumnAliasing support pruning adjacent projects
XiDuo You created SPARK-42504: - Summary: NestedColumnAliasing support pruning adjacent projects Key: SPARK-42504 URL: https://issues.apache.org/jira/browse/SPARK-42504 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: XiDuo You CollapseProject won't combine adjacent projects into one, e.g. non-cheap expression has been accessed more than once with the below project. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42501) High level design doc for Spark ML
[ https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42501: - Parent: SPARK-42471 Issue Type: Sub-task (was: New Feature) > High level design doc for Spark ML > -- > > Key: SPARK-42501 > URL: https://issues.apache.org/jira/browse/SPARK-42501 > Project: Spark > Issue Type: Sub-task > Components: Connect, Documentation, ML >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42412) Initial prototype implementation for PySparkML
[ https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42412: - Epic Link: (was: SPARK-39375) > Initial prototype implementation for PySparkML > -- > > Key: SPARK-42412 > URL: https://issues.apache.org/jira/browse/SPARK-42412 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42412) Initial prototype implementation for PySparkML
[ https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42412: - Parent: SPARK-42471 Issue Type: Sub-task (was: New Feature) > Initial prototype implementation for PySparkML > -- > > Key: SPARK-42412 > URL: https://issues.apache.org/jira/browse/SPARK-42412 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42412) Initial prototype implementation for PySparkML
[ https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42412: - Epic Link: SPARK-39375 > Initial prototype implementation for PySparkML > -- > > Key: SPARK-42412 > URL: https://issues.apache.org/jira/browse/SPARK-42412 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42501) High level design doc for Spark ML
[ https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42501: - Parent: (was: SPARK-39375) Issue Type: New Feature (was: Sub-task) > High level design doc for Spark ML > -- > > Key: SPARK-42501 > URL: https://issues.apache.org/jira/browse/SPARK-42501 > Project: Spark > Issue Type: New Feature > Components: Connect, Documentation, ML >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42412) Initial prototype implementation for PySparkML
[ https://issues.apache.org/jira/browse/SPARK-42412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42412: - Parent: (was: SPARK-39375) Issue Type: New Feature (was: Sub-task) > Initial prototype implementation for PySparkML > -- > > Key: SPARK-42412 > URL: https://issues.apache.org/jira/browse/SPARK-42412 > Project: Spark > Issue Type: New Feature > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42502) scala: accept user_agent in spark connect's connection string
[ https://issues.apache.org/jira/browse/SPARK-42502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-42502: - Fix Version/s: (was: 3.4.0) > scala: accept user_agent in spark connect's connection string > - > > Key: SPARK-42502 > URL: https://issues.apache.org/jira/browse/SPARK-42502 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > > Currently, the Spark Connect service's {{client_type}} attribute (which is > really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark. > Accept an optional {{user_agent}} parameter in the connection string and > plumb this down to the Spark Connect service. > This enables partners using Spark Connect to set their application as the > user agent, > which then allows visibility and measurement of integrations and usages of > spark > connect. > This is already done for the Python client: > https://github.com/apache/spark/commit/b887d3de954ae5b2482087fe08affcc4ac60c669 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42502) scala: accept user_agent in spark connect's connection string
[ https://issues.apache.org/jira/browse/SPARK-42502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42502: Assignee: (was: Niranjan Jayakar) > scala: accept user_agent in spark connect's connection string > - > > Key: SPARK-42502 > URL: https://issues.apache.org/jira/browse/SPARK-42502 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Priority: Major > > Currently, the Spark Connect service's {{client_type}} attribute (which is > really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark. > Accept an optional {{user_agent}} parameter in the connection string and > plumb this down to the Spark Connect service. > This enables partners using Spark Connect to set their application as the > user agent, > which then allows visibility and measurement of integrations and usages of > spark > connect. > This is already done for the Python client: > https://github.com/apache/spark/commit/b887d3de954ae5b2482087fe08affcc4ac60c669 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42503) Spark SQL should do further validation on join condition fields
[ https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691321#comment-17691321 ] Yuming Wang commented on SPARK-42503: - Do other databases also have this validation? > Spark SQL should do further validation on join condition fields > --- > > Key: SPARK-42503 > URL: https://issues.apache.org/jira/browse/SPARK-42503 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > > In Spark SQL, the conditions for the join use fields that are allowed to be > fields from a non-left table or a non-right table. In this case, the join > will degenerate into a cross join. > Suppose you have two tables, test1 and test2, which have the same table > schema: > {code:java} > ``` > CREATE TABLE `default`.`test1` ( > `id` INT, > `name` STRING, > `age` INT, > `dt` STRING) > USING parquet > PARTITIONED BY (dt) > ```{code} > The following SQL has three joins, but in the last left join, the conditions > is `t1.name=t2.name`, and t3.name is not used. So the last left join will be > cross join. > {code:java} > ``` > select * > from > (select * from test1 where dt="20230215" and age=1 ) t1 > left join > (select * from test1 where dt=="20230215" and age=2) t2 > on t1.name=t2.name > left join > (select * from test2 where dt="20230215") t3 > on > t1.name=t2.name; > ```{code} > So i think Spark SQL should do further validation on join condition, the > fields of join condition must be a left table or right table field , > otherwise it is thrown `AnalysisException`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42503) Spark SQL should do further validation on join condition fields
[ https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-42503: Fix Version/s: (was: 3.4.0) > Spark SQL should do further validation on join condition fields > --- > > Key: SPARK-42503 > URL: https://issues.apache.org/jira/browse/SPARK-42503 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > > In Spark SQL, the conditions for the join use fields that are allowed to be > fields from a non-left table or a non-right table. In this case, the join > will degenerate into a cross join. > Suppose you have two tables, test1 and test2, which have the same table > schema: > {code:java} > ``` > CREATE TABLE `default`.`test1` ( > `id` INT, > `name` STRING, > `age` INT, > `dt` STRING) > USING parquet > PARTITIONED BY (dt) > ```{code} > The following SQL has three joins, but in the last left join, the conditions > is `t1.name=t2.name`, and t3.name is not used. So the last left join will be > cross join. > {code:java} > ``` > select * > from > (select * from test1 where dt="20230215" and age=1 ) t1 > left join > (select * from test1 where dt=="20230215" and age=2) t2 > on t1.name=t2.name > left join > (select * from test2 where dt="20230215") t3 > on > t1.name=t2.name; > ```{code} > So i think Spark SQL should do further validation on join condition, the > fields of join condition must be a left table or right table field , > otherwise it is thrown `AnalysisException`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42503) Spark SQL should do further validation on join condition fields
[ https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-42503: Target Version/s: (was: 3.4.0) > Spark SQL should do further validation on join condition fields > --- > > Key: SPARK-42503 > URL: https://issues.apache.org/jira/browse/SPARK-42503 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > Fix For: 3.4.0 > > > In Spark SQL, the conditions for the join use fields that are allowed to be > fields from a non-left table or a non-right table. In this case, the join > will degenerate into a cross join. > Suppose you have two tables, test1 and test2, which have the same table > schema: > {code:java} > ``` > CREATE TABLE `default`.`test1` ( > `id` INT, > `name` STRING, > `age` INT, > `dt` STRING) > USING parquet > PARTITIONED BY (dt) > ```{code} > The following SQL has three joins, but in the last left join, the conditions > is `t1.name=t2.name`, and t3.name is not used. So the last left join will be > cross join. > {code:java} > ``` > select * > from > (select * from test1 where dt="20230215" and age=1 ) t1 > left join > (select * from test1 where dt=="20230215" and age=2) t2 > on t1.name=t2.name > left join > (select * from test2 where dt="20230215") t3 > on > t1.name=t2.name; > ```{code} > So i think Spark SQL should do further validation on join condition, the > fields of join condition must be a left table or right table field , > otherwise it is thrown `AnalysisException`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41823) DataFrame.join creating ambiguous column names
[ https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691290#comment-17691290 ] Apache Spark commented on SPARK-41823: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/40094 > DataFrame.join creating ambiguous column names > -- > > Key: SPARK-41823 > URL: https://issues.apache.org/jira/browse/SPARK-41823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 254, in pyspark.sql.connect.dataframe.DataFrame.drop > Failed example: > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, > `name`]. > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41823) DataFrame.join creating ambiguous column names
[ https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691288#comment-17691288 ] Apache Spark commented on SPARK-41823: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/40094 > DataFrame.join creating ambiguous column names > -- > > Key: SPARK-41823 > URL: https://issues.apache.org/jira/browse/SPARK-41823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 254, in pyspark.sql.connect.dataframe.DataFrame.drop > Failed example: > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, > `name`]. > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41823) DataFrame.join creating ambiguous column names
[ https://issues.apache.org/jira/browse/SPARK-41823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691289#comment-17691289 ] Apache Spark commented on SPARK-41823: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/40094 > DataFrame.join creating ambiguous column names > -- > > Key: SPARK-41823 > URL: https://issues.apache.org/jira/browse/SPARK-41823 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Sandeep Singh >Assignee: Ruifeng Zheng >Priority: Major > > {code:java} > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 254, in pyspark.sql.connect.dataframe.DataFrame.drop > Failed example: > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > Exception raised: > Traceback (most recent call last): > File > "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/doctest.py", > line 1350, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df.join(df2, df.name == df2.name, 'inner').drop('name').show() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 534, in show > print(self._show_string(n, truncate, vertical)) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 423, in _show_string > ).toPandas() > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/dataframe.py", > line 1031, in toPandas > return self._session.client.to_pandas(query) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 413, in to_pandas > return self._execute_and_fetch(req) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 573, in _execute_and_fetch > self._handle_error(rpc_error) > File > "/Users/s.singh/personal/spark-oss/python/pyspark/sql/connect/client.py", > line 619, in _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could be: [`name`, > `name`]. > Plan: {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41812) DataFrame.join: ambiguous column
[ https://issues.apache.org/jira/browse/SPARK-41812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691287#comment-17691287 ] Apache Spark commented on SPARK-41812: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/40094 > DataFrame.join: ambiguous column > > > Key: SPARK-41812 > URL: https://issues.apache.org/jira/browse/SPARK-41812 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > > {code} > File "/.../spark/python/pyspark/sql/connect/column.py", line 106, in > pyspark.sql.connect.column.Column.eqNullSafe > Failed example: > df1.join(df2, df1["value"] == df2["value"]).count() > Exception raised: > Traceback (most recent call last): > File "/.../miniconda3/envs/python3.9/lib/python3.9/doctest.py", line > 1336, in __run > exec(compile(example.source, filename, "single", > File "", line > 1, in > df1.join(df2, df1["value"] == df2["value"]).count() > File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 151, in > count > pdd = self.agg(_invoke_function("count", lit(1))).toPandas() > File "/.../spark/python/pyspark/sql/connect/dataframe.py", line 1031, > in toPandas > return self._session.client.to_pandas(query) > File "/.../spark/python/pyspark/sql/connect/client.py", line 413, in > to_pandas > return self._execute_and_fetch(req) > File "/.../spark/python/pyspark/sql/connect/client.py", line 573, in > _execute_and_fetch > self._handle_error(rpc_error) > File "/.../spark/python/pyspark/sql/connect/client.py", line 619, in > _handle_error > raise SparkConnectAnalysisException( > pyspark.sql.connect.client.SparkConnectAnalysisException: > [AMBIGUOUS_REFERENCE] Reference `value` is ambiguous, could be: [`value`, > `value`]. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec
[ https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-41952. -- Fix Version/s: 3.2.4 3.4.0 3.3.3 Resolution: Fixed > Upgrade Parquet to fix off-heap memory leaks in Zstd codec > -- > > Key: SPARK-41952 > URL: https://issues.apache.org/jira/browse/SPARK-41952 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.3, 3.3.1, 3.2.3 >Reporter: Alexey Kudinkin >Assignee: Cheng Pan >Priority: Critical > Fix For: 3.2.4, 3.4.0, 3.3.3 > > > Recently, native memory leak have been discovered in Parquet in conjunction > of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160). > This is very problematic to a point where we can't use Parquet w/ Zstd due to > pervasive OOMs taking down our executors and disrupting our jobs. > Luckily fix addressing this had already landed in Parquet: > [https://github.com/apache/parquet-mr/pull/982] > > Now, we just need to > # Updated version of Parquet is released in a timely manner > # Spark is upgraded onto this new version in the upcoming release > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41952) Upgrade Parquet to fix off-heap memory leaks in Zstd codec
[ https://issues.apache.org/jira/browse/SPARK-41952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-41952: Assignee: Cheng Pan > Upgrade Parquet to fix off-heap memory leaks in Zstd codec > -- > > Key: SPARK-41952 > URL: https://issues.apache.org/jira/browse/SPARK-41952 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.1.3, 3.3.1, 3.2.3 >Reporter: Alexey Kudinkin >Assignee: Cheng Pan >Priority: Critical > > Recently, native memory leak have been discovered in Parquet in conjunction > of it using Zstd decompressor from luben/zstd-jni library (PARQUET-2160). > This is very problematic to a point where we can't use Parquet w/ Zstd due to > pervasive OOMs taking down our executors and disrupting our jobs. > Luckily fix addressing this had already landed in Parquet: > [https://github.com/apache/parquet-mr/pull/982] > > Now, we just need to > # Updated version of Parquet is released in a timely manner > # Spark is upgraded onto this new version in the upcoming release > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42467) Spark Connect Scala Client: GroupBy and Aggregation
[ https://issues.apache.org/jira/browse/SPARK-42467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691264#comment-17691264 ] Rui Wang commented on SPARK-42467: -- Yes we gonna need to support cube/rollup/groupingsets along with others necessary bits in Aggregation. > Spark Connect Scala Client: GroupBy and Aggregation > --- > > Key: SPARK-42467 > URL: https://issues.apache.org/jira/browse/SPARK-42467 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42503) Spark SQL should do further validation on join condition fields
[ https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ming95 updated SPARK-42503: --- Description: In Spark SQL, the conditions for the join use fields that are allowed to be fields from a non-left table or a non-right table. In this case, the join will degenerate into a cross join. Suppose you have two tables, test1 and test2, which have the same table schema: {code:java} ``` CREATE TABLE `default`.`test1` ( `id` INT, `name` STRING, `age` INT, `dt` STRING) USING parquet PARTITIONED BY (dt) ```{code} The following SQL has three joins, but in the last left join, the conditions is `t1.name=t2.name`, and t3.name is not used. So the last left join will be cross join. {code:java} ``` select * from (select * from test1 where dt="20230215" and age=1 ) t1 left join (select * from test1 where dt=="20230215" and age=2) t2 on t1.name=t2.name left join (select * from test2 where dt="20230215") t3 on t1.name=t2.name; ```{code} So i think Spark SQL should do further validation on join condition, the fields of join condition must be a left table or right table field , otherwise it is thrown `AnalysisException`. was: In Spark SQL, the conditions for the join use fields that are allowed to be fields from a non-left table or a non-right table. In this case, the join will degenerate into a cross join. Suppose you have two tables, test1 and test2, which have the same table schema: ``` CREATE TABLE `default`.`test1` ( `id` INT, `name` STRING, `age` INT, `dt` STRING) USING parquet PARTITIONED BY (dt) ``` The following SQL has three joins, but in the last left join, the conditions is `t1.name=t2.name`, and t3.name is not used. So the last left join will be cross join. ``` select * from (select * from test1 where dt="20230215" and age=1 ) t1 left join (select * from test1 where dt=="20230215" and age=2) t2 on t1.name=t2.name left join (select * from test2 where dt="20230215") t3 on t1.name=t2.name; ``` So i think Spark SQL should do further validation on join condition, the fields of join condition must be a left table or right table field , otherwise it is thrown `AnalysisException`. > Spark SQL should do further validation on join condition fields > --- > > Key: SPARK-42503 > URL: https://issues.apache.org/jira/browse/SPARK-42503 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > Fix For: 3.4.0 > > > In Spark SQL, the conditions for the join use fields that are allowed to be > fields from a non-left table or a non-right table. In this case, the join > will degenerate into a cross join. > Suppose you have two tables, test1 and test2, which have the same table > schema: > {code:java} > ``` > CREATE TABLE `default`.`test1` ( > `id` INT, > `name` STRING, > `age` INT, > `dt` STRING) > USING parquet > PARTITIONED BY (dt) > ```{code} > The following SQL has three joins, but in the last left join, the conditions > is `t1.name=t2.name`, and t3.name is not used. So the last left join will be > cross join. > {code:java} > ``` > select * > from > (select * from test1 where dt="20230215" and age=1 ) t1 > left join > (select * from test1 where dt=="20230215" and age=2) t2 > on t1.name=t2.name > left join > (select * from test2 where dt="20230215") t3 > on > t1.name=t2.name; > ```{code} > So i think Spark SQL should do further validation on join condition, the > fields of join condition must be a left table or right table field , > otherwise it is thrown `AnalysisException`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42503) Spark SQL should do further validation on join condition fields
[ https://issues.apache.org/jira/browse/SPARK-42503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691235#comment-17691235 ] ming95 commented on SPARK-42503: [~yumwang] [~gurwls223] cc > Spark SQL should do further validation on join condition fields > --- > > Key: SPARK-42503 > URL: https://issues.apache.org/jira/browse/SPARK-42503 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.2 >Reporter: ming95 >Priority: Major > Fix For: 3.4.0 > > > In Spark SQL, the conditions for the join use fields that are allowed to be > fields from a non-left table or a non-right table. In this case, the join > will degenerate into a cross join. > Suppose you have two tables, test1 and test2, which have the same table > schema: > ``` > CREATE TABLE `default`.`test1` ( > `id` INT, > `name` STRING, > `age` INT, > `dt` STRING) > USING parquet > PARTITIONED BY (dt) > ``` > The following SQL has three joins, but in the last left join, the conditions > is `t1.name=t2.name`, and t3.name is not used. So the last left join will be > cross join. > ``` > select * > from > (select * from test1 where dt="20230215" and age=1 ) t1 > left join > (select * from test1 where dt=="20230215" and age=2) t2 > on t1.name=t2.name > left join > (select * from test2 where dt="20230215") t3 > on > t1.name=t2.name; > ``` > So i think Spark SQL should do further validation on join condition, the > fields of join condition must be a left table or right table field , > otherwise it is thrown `AnalysisException`. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42503) Spark SQL should do further validation on join condition fields
ming95 created SPARK-42503: -- Summary: Spark SQL should do further validation on join condition fields Key: SPARK-42503 URL: https://issues.apache.org/jira/browse/SPARK-42503 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.2 Reporter: ming95 Fix For: 3.4.0 In Spark SQL, the conditions for the join use fields that are allowed to be fields from a non-left table or a non-right table. In this case, the join will degenerate into a cross join. Suppose you have two tables, test1 and test2, which have the same table schema: ``` CREATE TABLE `default`.`test1` ( `id` INT, `name` STRING, `age` INT, `dt` STRING) USING parquet PARTITIONED BY (dt) ``` The following SQL has three joins, but in the last left join, the conditions is `t1.name=t2.name`, and t3.name is not used. So the last left join will be cross join. ``` select * from (select * from test1 where dt="20230215" and age=1 ) t1 left join (select * from test1 where dt=="20230215" and age=2) t2 on t1.name=t2.name left join (select * from test2 where dt="20230215") t3 on t1.name=t2.name; ``` So i think Spark SQL should do further validation on join condition, the fields of join condition must be a left table or right table field , otherwise it is thrown `AnalysisException`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42502) scala: accept user_agent in spark connect's connection string
[ https://issues.apache.org/jira/browse/SPARK-42502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niranjan Jayakar updated SPARK-42502: - Description: Currently, the Spark Connect service's {{client_type}} attribute (which is really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark. Accept an optional {{user_agent}} parameter in the connection string and plumb this down to the Spark Connect service. This enables partners using Spark Connect to set their application as the user agent, which then allows visibility and measurement of integrations and usages of spark connect. This is already done for the Python client: https://github.com/apache/spark/commit/b887d3de954ae5b2482087fe08affcc4ac60c669 was: Currently, the Spark Connect service's {{client_type}} attribute (which is really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark. Accept an optional {{user_agent}} parameter in the connection string and plumb this down to the Spark Connect service. This enables partners using Spark Connect to set their application as the user agent, which then allows visibility and measurement of integrations and usages of spark connect. > scala: accept user_agent in spark connect's connection string > - > > Key: SPARK-42502 > URL: https://issues.apache.org/jira/browse/SPARK-42502 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > Fix For: 3.4.0 > > > Currently, the Spark Connect service's {{client_type}} attribute (which is > really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark. > Accept an optional {{user_agent}} parameter in the connection string and > plumb this down to the Spark Connect service. > This enables partners using Spark Connect to set their application as the > user agent, > which then allows visibility and measurement of integrations and usages of > spark > connect. > This is already done for the Python client: > https://github.com/apache/spark/commit/b887d3de954ae5b2482087fe08affcc4ac60c669 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42502) scala: accept user_agent in spark connect's connection string
Niranjan Jayakar created SPARK-42502: Summary: scala: accept user_agent in spark connect's connection string Key: SPARK-42502 URL: https://issues.apache.org/jira/browse/SPARK-42502 Project: Spark Issue Type: New Feature Components: Connect Affects Versions: 3.3.2 Reporter: Niranjan Jayakar Assignee: Niranjan Jayakar Fix For: 3.4.0 Currently, the Spark Connect service's {{client_type}} attribute (which is really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark. Accept an optional {{user_agent}} parameter in the connection string and plumb this down to the Spark Connect service. This enables partners using Spark Connect to set their application as the user agent, which then allows visibility and measurement of integrations and usages of spark connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42477) python: accept user_agent in spark connect's connection string
[ https://issues.apache.org/jira/browse/SPARK-42477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niranjan Jayakar updated SPARK-42477: - Summary: python: accept user_agent in spark connect's connection string (was: accept user_agent in spark connect's connection string) > python: accept user_agent in spark connect's connection string > --- > > Key: SPARK-42477 > URL: https://issues.apache.org/jira/browse/SPARK-42477 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > Fix For: 3.4.0 > > > Currently, the Spark Connect service's {{client_type}} attribute (which is > really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark. > Accept an optional {{user_agent}} parameter in the connection string and > plumb this down to the Spark Connect service. > This enables partners using Spark Connect to set their application as the > user agent, > which then allows visibility and measurement of integrations and usages of > spark > connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42498) reduce spark connect service retry time
[ https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niranjan Jayakar resolved SPARK-42498. -- Resolution: Abandoned > reduce spark connect service retry time > --- > > Key: SPARK-42498 > URL: https://issues.apache.org/jira/browse/SPARK-42498 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Priority: Major > > https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411 > > Currently, 15 retries with the current backoff strategy result in the client > sitting in > the retry loop for ~400 seconds in the worst case. This means, applications > and > users using the spark connect client will hang for >6 minutes with no > response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42498) reduce spark connect service retry time
[ https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niranjan Jayakar updated SPARK-42498: - Summary: reduce spark connect service retry time (was: make spark connect retries configurat) > reduce spark connect service retry time > --- > > Key: SPARK-42498 > URL: https://issues.apache.org/jira/browse/SPARK-42498 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Priority: Major > > https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411 > > Currently, 15 retries with the current backoff strategy result in the client > sitting in > the retry loop for ~400 seconds in the worst case. This means, applications > and > users using the spark connect client will hang for >6 minutes with no > response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42498) make spark connect retries configurat
[ https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niranjan Jayakar updated SPARK-42498: - Summary: make spark connect retries configurat (was: reduce spark connect service retry time) > make spark connect retries configurat > - > > Key: SPARK-42498 > URL: https://issues.apache.org/jira/browse/SPARK-42498 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Priority: Major > > https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411 > > Currently, 15 retries with the current backoff strategy result in the client > sitting in > the retry loop for ~400 seconds in the worst case. This means, applications > and > users using the spark connect client will hang for >6 minutes with no > response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42423) Add metadata column file block start and length
[ https://issues.apache.org/jira/browse/SPARK-42423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42423: --- Assignee: XiDuo You > Add metadata column file block start and length > --- > > Key: SPARK-42423 > URL: https://issues.apache.org/jira/browse/SPARK-42423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42423) Add metadata column file block start and length
[ https://issues.apache.org/jira/browse/SPARK-42423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42423. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39996 [https://github.com/apache/spark/pull/39996] > Add metadata column file block start and length > --- > > Key: SPARK-42423 > URL: https://issues.apache.org/jira/browse/SPARK-42423 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: XiDuo You >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42476) Spark Connect API reference.
[ https://issues.apache.org/jira/browse/SPARK-42476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42476: Assignee: Haejoon Lee > Spark Connect API reference. > > > Key: SPARK-42476 > URL: https://issues.apache.org/jira/browse/SPARK-42476 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > We need an API documents for Spark Connect such as other components. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42476) Spark Connect API reference.
[ https://issues.apache.org/jira/browse/SPARK-42476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42476. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40067 [https://github.com/apache/spark/pull/40067] > Spark Connect API reference. > > > Key: SPARK-42476 > URL: https://issues.apache.org/jira/browse/SPARK-42476 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > We need an API documents for Spark Connect such as other components. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42490) Upgrade protobuf-java to 3.22.0
[ https://issues.apache.org/jira/browse/SPARK-42490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-42490: Assignee: Yang Jie > Upgrade protobuf-java to 3.22.0 > --- > > Key: SPARK-42490 > URL: https://issues.apache.org/jira/browse/SPARK-42490 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > https://github.com/protocolbuffers/protobuf/releases/tag/v22.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42490) Upgrade protobuf-java to 3.22.0
[ https://issues.apache.org/jira/browse/SPARK-42490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42490. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40084 [https://github.com/apache/spark/pull/40084] > Upgrade protobuf-java to 3.22.0 > --- > > Key: SPARK-42490 > URL: https://issues.apache.org/jira/browse/SPARK-42490 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.5.0 > > > https://github.com/protocolbuffers/protobuf/releases/tag/v22.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42490) Upgrade protobuf-java to 3.22.0
[ https://issues.apache.org/jira/browse/SPARK-42490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42490: - Priority: Minor (was: Major) > Upgrade protobuf-java to 3.22.0 > --- > > Key: SPARK-42490 > URL: https://issues.apache.org/jira/browse/SPARK-42490 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > > https://github.com/protocolbuffers/protobuf/releases/tag/v22.0 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42489) Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0
[ https://issues.apache.org/jira/browse/SPARK-42489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42489. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 40083 [https://github.com/apache/spark/pull/40083] > Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0 > > > Key: SPARK-42489 > URL: https://issues.apache.org/jira/browse/SPARK-42489 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.5.0 > > > https://github.com/scala/scala-parser-combinators/releases -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42489) Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0
[ https://issues.apache.org/jira/browse/SPARK-42489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-42489: Assignee: Yang Jie > Upgrdae scala-parser-combinators from 2.1.1 to 2.2.0 > > > Key: SPARK-42489 > URL: https://issues.apache.org/jira/browse/SPARK-42489 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > https://github.com/scala/scala-parser-combinators/releases -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42477) accept user_agent in spark connect's connection string
[ https://issues.apache.org/jira/browse/SPARK-42477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42477. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40054 [https://github.com/apache/spark/pull/40054] > accept user_agent in spark connect's connection string > --- > > Key: SPARK-42477 > URL: https://issues.apache.org/jira/browse/SPARK-42477 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > Fix For: 3.4.0 > > > Currently, the Spark Connect service's {{client_type}} attribute (which is > really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark. > Accept an optional {{user_agent}} parameter in the connection string and > plumb this down to the Spark Connect service. > This enables partners using Spark Connect to set their application as the > user agent, > which then allows visibility and measurement of integrations and usages of > spark > connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42477) accept user_agent in spark connect's connection string
[ https://issues.apache.org/jira/browse/SPARK-42477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42477: Assignee: Niranjan Jayakar > accept user_agent in spark connect's connection string > --- > > Key: SPARK-42477 > URL: https://issues.apache.org/jira/browse/SPARK-42477 > Project: Spark > Issue Type: New Feature > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Assignee: Niranjan Jayakar >Priority: Major > > Currently, the Spark Connect service's {{client_type}} attribute (which is > really user agent) is set to {{_SPARK_CONNECT_PYTHON}} to signify PySpark. > Accept an optional {{user_agent}} parameter in the connection string and > plumb this down to the Spark Connect service. > This enables partners using Spark Connect to set their application as the > user agent, > which then allows visibility and measurement of integrations and usages of > spark > connect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-42501) High level design doc for Spark ML
[ https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691169#comment-17691169 ] Weichen Xu edited comment on SPARK-42501 at 2/20/23 1:25 PM: - The doc is not ready yet. :) was (Author: weichenxu123): CC [~mengxr] [~grundprinzip-db] [~podongfeng] [~srowen] Thanks! > High level design doc for Spark ML > -- > > Key: SPARK-42501 > URL: https://issues.apache.org/jira/browse/SPARK-42501 > Project: Spark > Issue Type: Sub-task > Components: Connect, Documentation, ML >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42501) High level design doc for Spark ML
[ https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-42501: --- Description: (was: Please find the HLD doc for spark ML via spark connect [here|https://docs.google.com/document/d/16_l3wXwbyPl6VwA0zOSdlrEymWVQIp5wfT9-MwKrmQU/edit?usp=sharing]. ) > High level design doc for Spark ML > -- > > Key: SPARK-42501 > URL: https://issues.apache.org/jira/browse/SPARK-42501 > Project: Spark > Issue Type: Sub-task > Components: Connect, Documentation, ML >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42501) High level design doc for Spark ML
[ https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691169#comment-17691169 ] Weichen Xu commented on SPARK-42501: CC [~mengxr] [~grundprinzip-db] [~podongfeng] [~srowen] Thanks! > High level design doc for Spark ML > -- > > Key: SPARK-42501 > URL: https://issues.apache.org/jira/browse/SPARK-42501 > Project: Spark > Issue Type: Sub-task > Components: Connect, Documentation, ML >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Please find the HLD doc for spark ML via spark connect > [here|https://docs.google.com/document/d/16_l3wXwbyPl6VwA0zOSdlrEymWVQIp5wfT9-MwKrmQU/edit?usp=sharing]. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42501) High level design doc for Spark ML
[ https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu reassigned SPARK-42501: -- Assignee: Weichen Xu > High level design doc for Spark ML > -- > > Key: SPARK-42501 > URL: https://issues.apache.org/jira/browse/SPARK-42501 > Project: Spark > Issue Type: Sub-task > Components: Connect, Documentation, ML >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Please find the HLD doc for spark ML via spark connect > [here|https://docs.google.com/document/d/16_l3wXwbyPl6VwA0zOSdlrEymWVQIp5wfT9-MwKrmQU/edit?usp=sharing]. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42501) High level design doc for Spark ML
[ https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-42501: --- Description: Please find the HLD doc for spark ML via spark connect [here|https://docs.google.com/document/d/16_l3wXwbyPl6VwA0zOSdlrEymWVQIp5wfT9-MwKrmQU/edit?usp=sharing]. > High level design doc for Spark ML > -- > > Key: SPARK-42501 > URL: https://issues.apache.org/jira/browse/SPARK-42501 > Project: Spark > Issue Type: Sub-task > Components: Connect, Documentation, ML >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Priority: Major > > Please find the HLD doc for spark ML via spark connect > [here|https://docs.google.com/document/d/16_l3wXwbyPl6VwA0zOSdlrEymWVQIp5wfT9-MwKrmQU/edit?usp=sharing]. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42501) High level design doc for Spark ML
[ https://issues.apache.org/jira/browse/SPARK-42501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Weichen Xu updated SPARK-42501: --- Component/s: Connect Documentation > High level design doc for Spark ML > -- > > Key: SPARK-42501 > URL: https://issues.apache.org/jira/browse/SPARK-42501 > Project: Spark > Issue Type: Sub-task > Components: Connect, Documentation, ML >Affects Versions: 3.4.0 >Reporter: Weichen Xu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42501) High level design doc for Spark ML
Weichen Xu created SPARK-42501: -- Summary: High level design doc for Spark ML Key: SPARK-42501 URL: https://issues.apache.org/jira/browse/SPARK-42501 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 3.4.0 Reporter: Weichen Xu -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
[ https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-41741. - Resolution: Fixed > [SQL] ParquetFilters StringStartsWith push down matching string do not use > UTF-8 > > > Key: SPARK-41741 > URL: https://issues.apache.org/jira/browse/SPARK-41741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jiale He >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0, 3.3.3 > > Attachments: image-2022-12-28-18-00-00-861.png, > image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, > image-2023-01-09-18-27-53-479.png, > part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet > > > Hello ~ > > I found a problem, but there are two ways to solve it. > > The parquet filter is pushed down. When using the like '***%' statement to > query, if the system default encoding is not UTF-8, it may cause an error. > > There are two ways to bypass this problem as far as I know > 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8" > 2. spark.sql.parquet.filterPushdown.string.startsWith=false > > The following is the information to reproduce this problem > The parquet sample file is in the attachment > {code:java} > spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”) > spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code} > !image-2022-12-28-18-00-00-861.png|width=879,height=430! > > !image-2022-12-28-18-00-21-586.png|width=799,height=731! > > I think the correct code should be: > {code:java} > private val strToBinary = > Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41741) [SQL] ParquetFilters StringStartsWith push down matching string do not use UTF-8
[ https://issues.apache.org/jira/browse/SPARK-41741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang reassigned SPARK-41741: --- Fix Version/s: 3.4.0 3.3.3 Assignee: Yuming Wang > [SQL] ParquetFilters StringStartsWith push down matching string do not use > UTF-8 > > > Key: SPARK-41741 > URL: https://issues.apache.org/jira/browse/SPARK-41741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Jiale He >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0, 3.3.3 > > Attachments: image-2022-12-28-18-00-00-861.png, > image-2022-12-28-18-00-21-586.png, image-2023-01-09-11-10-31-262.png, > image-2023-01-09-18-27-53-479.png, > part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet > > > Hello ~ > > I found a problem, but there are two ways to solve it. > > The parquet filter is pushed down. When using the like '***%' statement to > query, if the system default encoding is not UTF-8, it may cause an error. > > There are two ways to bypass this problem as far as I know > 1. spark.executor.extraJavaOptions="-Dfile.encoding=UTF-8" > 2. spark.sql.parquet.filterPushdown.string.startsWith=false > > The following is the information to reproduce this problem > The parquet sample file is in the attachment > {code:java} > spark.read.parquet("file:///home/kylin/hjldir/part-0-30432312-7cdb-43ef-befe-93bcfd174878-c000.snappy.parquet").createTempView("tmp”) > spark.sql("select * from tmp where `1` like '啦啦乐乐%'").show(false) {code} > !image-2022-12-28-18-00-00-861.png|width=879,height=430! > > !image-2022-12-28-18-00-21-586.png|width=799,height=731! > > I think the correct code should be: > {code:java} > private val strToBinary = > Binary.fromReusedByteArray(v.getBytes(StandardCharsets.UTF_8)) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691136#comment-17691136 ] Apache Spark commented on SPARK-42500: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40093 > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42500: Assignee: Apache Spark > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42500: Assignee: (was: Apache Spark) > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42500) ConstantPropagation support more cases
[ https://issues.apache.org/jira/browse/SPARK-42500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691135#comment-17691135 ] Apache Spark commented on SPARK-42500: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/40093 > ConstantPropagation support more cases > -- > > Key: SPARK-42500 > URL: https://issues.apache.org/jira/browse/SPARK-42500 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42500) ConstantPropagation support more cases
Yuming Wang created SPARK-42500: --- Summary: ConstantPropagation support more cases Key: SPARK-42500 URL: https://issues.apache.org/jira/browse/SPARK-42500 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.5.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42486) Upgrade ZooKeeper from 3.6.3 to 3.6.4
[ https://issues.apache.org/jira/browse/SPARK-42486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bjørn Jørgensen updated SPARK-42486: Affects Version/s: 3.4.0 > Upgrade ZooKeeper from 3.6.3 to 3.6.4 > - > > Key: SPARK-42486 > URL: https://issues.apache.org/jira/browse/SPARK-42486 > Project: Spark > Issue Type: Dependency upgrade > Components: Build >Affects Versions: 3.4.0, 3.5.0 >Reporter: Bjørn Jørgensen >Priority: Major > > [ZooKeeper 3.6 is EoL since 30th December, > 2022|https://zookeeper.apache.org/releases.html] > [Release notes|https://zookeeper.apache.org/doc/r3.6.4/releasenotes.html] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42499) Support for Runtime SQL configuration
Ruifeng Zheng created SPARK-42499: - Summary: Support for Runtime SQL configuration Key: SPARK-42499 URL: https://issues.apache.org/jira/browse/SPARK-42499 Project: Spark Issue Type: Umbrella Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41959) Improve v1 writes with empty2null
[ https://issues.apache.org/jira/browse/SPARK-41959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-41959. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39475 [https://github.com/apache/spark/pull/39475] > Improve v1 writes with empty2null > - > > Key: SPARK-41959 > URL: https://issues.apache.org/jira/browse/SPARK-41959 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > Labels: correctness > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41324) Follow-up on JDK-8180450
[ https://issues.apache.org/jira/browse/SPARK-41324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691072#comment-17691072 ] Yang Jie commented on SPARK-41324: -- >From the affected versions of >[JDK-8180450|https://bugs.openjdk.org/browse/JDK-8180450], if Java 8 is used, >will it still be affected? > Follow-up on JDK-8180450 > > > Key: SPARK-41324 > URL: https://issues.apache.org/jira/browse/SPARK-41324 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.2, 3.3.1 >Reporter: Herman van Hövell >Priority: Major > > Per [https://twitter.com/forked_franz/status/1597468851968831489] > We should follow-up on: [https://bugs.openjdk.org/browse/JDK-8180450] > There are two concrete tasks here: > # Upgrade to Netty 4.1.84. > # (Optional) Write a benchmark that exercises this code path. Anchoring this > in the build will be a bit of challenge though. > # Check if there are other places where this bug manifests itself. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42398) refine default column value framework
[ https://issues.apache.org/jira/browse/SPARK-42398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42398. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40049 [https://github.com/apache/spark/pull/40049] > refine default column value framework > - > > Key: SPARK-42398 > URL: https://issues.apache.org/jira/browse/SPARK-42398 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42398) refine default column value framework
[ https://issues.apache.org/jira/browse/SPARK-42398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42398: --- Assignee: Wenchen Fan > refine default column value framework > - > > Key: SPARK-42398 > URL: https://issues.apache.org/jira/browse/SPARK-42398 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42498) reduce spark connect service retry time
[ https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17691056#comment-17691056 ] Apache Spark commented on SPARK-42498: -- User 'nija-at' has created a pull request for this issue: https://github.com/apache/spark/pull/40066 > reduce spark connect service retry time > --- > > Key: SPARK-42498 > URL: https://issues.apache.org/jira/browse/SPARK-42498 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Priority: Major > > https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411 > > Currently, 15 retries with the current backoff strategy result in the client > sitting in > the retry loop for ~400 seconds in the worst case. This means, applications > and > users using the spark connect client will hang for >6 minutes with no > response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42498) reduce spark connect service retry time
[ https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42498: Assignee: Apache Spark > reduce spark connect service retry time > --- > > Key: SPARK-42498 > URL: https://issues.apache.org/jira/browse/SPARK-42498 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Assignee: Apache Spark >Priority: Major > > https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411 > > Currently, 15 retries with the current backoff strategy result in the client > sitting in > the retry loop for ~400 seconds in the worst case. This means, applications > and > users using the spark connect client will hang for >6 minutes with no > response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42498) reduce spark connect service retry time
[ https://issues.apache.org/jira/browse/SPARK-42498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42498: Assignee: (was: Apache Spark) > reduce spark connect service retry time > --- > > Key: SPARK-42498 > URL: https://issues.apache.org/jira/browse/SPARK-42498 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.3.2 >Reporter: Niranjan Jayakar >Priority: Major > > https://github.com/apache/spark/blob/5fc44dabe5084fb784f064afe691951a3c270793/python/pyspark/sql/connect/client.py#L411 > > Currently, 15 retries with the current backoff strategy result in the client > sitting in > the retry loop for ~400 seconds in the worst case. This means, applications > and > users using the spark connect client will hang for >6 minutes with no > response. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org