[jira] [Updated] (SPARK-37471) spark-sql support nested bracketed comment
[ https://issues.apache.org/jira/browse/SPARK-37471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37471: -- Description: {code} /* select select /* BROADCAST(b) */ 4\\; */ select 1 ; {code} failed in spark-sql -f {code} /usr/share/spark-3.2/bin/spark-sql --verbose -f test.sql {code} {code} Spark master: yarn, Application Id: application_1632999510150_6968442 /* sielect /* BROADCAST(b) */ 4 Error in query: mismatched input '4' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 30) == SQL == /* sielect /* BROADCAST(b) */ 4 --^^^ {code} was: {code} /* select select /* BROADCAST(b) */ 4\\; */ select 1 ; {code} failed in spark-sql {code} Spark master: yarn, Application Id: application_1632999510150_6968442 /* sielect /* BROADCAST(b) */ 4 Error in query: mismatched input '4' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 30) == SQL == /* sielect /* BROADCAST(b) */ 4 --^^^ {code} > spark-sql support nested bracketed comment > -- > > Key: SPARK-37471 > URL: https://issues.apache.org/jira/browse/SPARK-37471 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > {code} > /* > select > select /* BROADCAST(b) */ 4\\; > */ > select 1 > ; > {code} > failed in spark-sql -f > {code} > /usr/share/spark-3.2/bin/spark-sql --verbose -f test.sql > {code} > {code} > Spark master: yarn, Application Id: application_1632999510150_6968442 > /* sielect /* BROADCAST(b) */ 4 > Error in query: > mismatched input '4' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', > 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', > 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', > 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', > 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', > 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, > pos 30) > == SQL == > /* sielect /* BROADCAST(b) */ 4 > --^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37471) spark-sql support nested bracketed comment
[ https://issues.apache.org/jira/browse/SPARK-37471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37471: -- Description: {code} /* select select /* BROADCAST(b) */ 4\\; */ select 1 ; {code} failed in spark-sql {code} Spark master: yarn, Application Id: application_1632999510150_6968442 /* sielect /* BROADCAST(b) */ 4 Error in query: mismatched input '4' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 30) == SQL == /* sielect /* BROADCAST(b) */ 4 --^^^ {code} was: {code} /* select select /* BROADCAST(b) */ 4\\; */ select 1 ; {code} failed in spark-sql {code} [info] 2021-11-25 23:30:18.727 - stderr> [info] 2021-11-25 23:30:18.757 - stderr> Setting default log level to "WARN". [info] 2021-11-25 23:30:18.758 - stderr> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). [info] 2021-11-25 23:30:28.226 - stderr> Spark master: local, Application Id: local-1637911820513 [info] 2021-11-25 23:30:29.506 - stdout> spark-sql> [info] 2021-11-25 23:30:29.535 - stdout> > /* [info] 2021-11-25 23:30:29.566 - stdout> > SELECT /* BROADCAST(b) */ 4; [info] 2021-11-25 23:30:29.592 - stdout> /* [info] 2021-11-25 23:30:29.592 - stdout> SELECT /* BROADCAST(b) */ 4 [info] 2021-11-25 23:30:30.239 - stderr> Error in query: [info] 2021-11-25 23:30:30.239 - stderr> Unclosed bracketed comment(line 1, pos 0) [info] 2021-11-25 23:30:30.239 - stderr> [info] 2021-11-25 23:30:30.24 - stderr> == SQL == [info] 2021-11-25 23:30:30.24 - stderr> /* [info] 2021-11-25 23:30:30.24 - stderr> ^^^ [info] 2021-11-25 23:30:30.24 - stderr> SELECT /* BROADCAST(b) */ 4 [info] 2021-11-25 23:30:30.24 - stderr> [info] 2021-11-25 23:30:30.28 - stdout> spark-sql> */ [info] 2021-11-25 23:30:30.308 - stdout> > SELECT 1 [info] 2021-11-25 23:30:30.336 - stdout> > ; [info] 2021-11-25 23:30:30.337 - stdout> */ [info] 2021-11-25 23:30:30.337 - stdout> SELECT 1 [info] 2021-11-25 23:30:30.337 - stdout> [info] 2021-11-25 23:30:30.339 - stderr> Error in query: [info] 2021-11-25 23:30:30.339 - stderr> extraneous input '*/' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) [info] 2021-11-25 23:30:30.339 - stderr> [info] 2021-11-25 23:30:30.339 - stderr> == SQL == [info] 2021-11-25 23:30:30.339 - stderr> */ [info] 2021-11-25 23:30:30.339 - stderr> ^^^ [info] 2021-11-25 23:30:30.339 - stderr> SELECT 1 [info] 2021-11-25 23:30:30.339 - stderr> [info] 2021-11-25 23:30:30.368 - stdout> spark-sql> [info] 2021-11-25 23:30:30.605 - stdout> {code} > spark-sql support nested bracketed comment > -- > > Key: SPARK-37471 > URL: https://issues.apache.org/jira/browse/SPARK-37471 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > {code} > /* > select > select /* BROADCAST(b) */ 4\\; > */ > select 1 > ; > {code} > failed in spark-sql > {code} > Spark master: yarn, Application Id: application_1632999510150_6968442 > /* sielect /* BROADCAST(b) */ 4 > Error in query: > mismatched input '4' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', > 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', > 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', > 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', > 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', > 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, > pos 30) > == SQL == > /* sielect /* BROADCAST(b) */ 4 > --^^^ > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37471) spark-sql support nested bracketed comment
[ https://issues.apache.org/jira/browse/SPARK-37471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] angerszhu updated SPARK-37471: -- Description: {code} /* select select /* BROADCAST(b) */ 4\\; */ select 1 ; {code} failed in spark-sql {code} [info] 2021-11-25 23:30:18.727 - stderr> [info] 2021-11-25 23:30:18.757 - stderr> Setting default log level to "WARN". [info] 2021-11-25 23:30:18.758 - stderr> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). [info] 2021-11-25 23:30:28.226 - stderr> Spark master: local, Application Id: local-1637911820513 [info] 2021-11-25 23:30:29.506 - stdout> spark-sql> [info] 2021-11-25 23:30:29.535 - stdout> > /* [info] 2021-11-25 23:30:29.566 - stdout> > SELECT /* BROADCAST(b) */ 4; [info] 2021-11-25 23:30:29.592 - stdout> /* [info] 2021-11-25 23:30:29.592 - stdout> SELECT /* BROADCAST(b) */ 4 [info] 2021-11-25 23:30:30.239 - stderr> Error in query: [info] 2021-11-25 23:30:30.239 - stderr> Unclosed bracketed comment(line 1, pos 0) [info] 2021-11-25 23:30:30.239 - stderr> [info] 2021-11-25 23:30:30.24 - stderr> == SQL == [info] 2021-11-25 23:30:30.24 - stderr> /* [info] 2021-11-25 23:30:30.24 - stderr> ^^^ [info] 2021-11-25 23:30:30.24 - stderr> SELECT /* BROADCAST(b) */ 4 [info] 2021-11-25 23:30:30.24 - stderr> [info] 2021-11-25 23:30:30.28 - stdout> spark-sql> */ [info] 2021-11-25 23:30:30.308 - stdout> > SELECT 1 [info] 2021-11-25 23:30:30.336 - stdout> > ; [info] 2021-11-25 23:30:30.337 - stdout> */ [info] 2021-11-25 23:30:30.337 - stdout> SELECT 1 [info] 2021-11-25 23:30:30.337 - stdout> [info] 2021-11-25 23:30:30.339 - stderr> Error in query: [info] 2021-11-25 23:30:30.339 - stderr> extraneous input '*/' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) [info] 2021-11-25 23:30:30.339 - stderr> [info] 2021-11-25 23:30:30.339 - stderr> == SQL == [info] 2021-11-25 23:30:30.339 - stderr> */ [info] 2021-11-25 23:30:30.339 - stderr> ^^^ [info] 2021-11-25 23:30:30.339 - stderr> SELECT 1 [info] 2021-11-25 23:30:30.339 - stderr> [info] 2021-11-25 23:30:30.368 - stdout> spark-sql> [info] 2021-11-25 23:30:30.605 - stdout> {code} was: {code} /* select select /* BROADCAST(b) */ 4\\; */ select 1 ; {code} failed in spark-sql > spark-sql support nested bracketed comment > -- > > Key: SPARK-37471 > URL: https://issues.apache.org/jira/browse/SPARK-37471 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > {code} > /* > select > select /* BROADCAST(b) */ 4\\; > */ > select 1 > ; > {code} > failed in spark-sql > {code} > [info] 2021-11-25 23:30:18.727 - stderr> > [info] 2021-11-25 23:30:18.757 - stderr> Setting default log level to > "WARN". > [info] 2021-11-25 23:30:18.758 - stderr> To adjust logging level use > sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). > [info] 2021-11-25 23:30:28.226 - stderr> Spark master: local, Application > Id: local-1637911820513 > [info] 2021-11-25 23:30:29.506 - stdout> spark-sql> > [info] 2021-11-25 23:30:29.535 - stdout> > /* > [info] 2021-11-25 23:30:29.566 - stdout> > SELECT /* BROADCAST(b) > */ 4; > [info] 2021-11-25 23:30:29.592 - stdout> /* > [info] 2021-11-25 23:30:29.592 - stdout> SELECT /* BROADCAST(b) */ 4 > [info] 2021-11-25 23:30:30.239 - stderr> Error in query: > [info] 2021-11-25 23:30:30.239 - stderr> Unclosed bracketed comment(line 1, > pos 0) > [info] 2021-11-25 23:30:30.239 - stderr> > [info] 2021-11-25 23:30:30.24 - stderr> == SQL == > [info] 2021-11-25 23:30:30.24 - stderr> /* > [info] 2021-11-25 23:30:30.24 - stderr> ^^^ > [info] 2021-11-25 23:30:30.24 - stderr> SELECT /* BROADCAST(b) */ 4 > [info] 2021-11-25 23:30:30.24 - stderr> > [info] 2021-11-25 23:30:30.28 - stdout> spark-sql> */ > [info] 2021-11-25 23:30:30.308 - stdout> > SELECT 1 > [info] 2021-11-25 23:30:30.336 - stdout> > ; > [info] 2021-11-25 23:30:30.337 - stdout> */ > [info] 2021-11-25 23:30:30.337 - stdout> SELECT 1 > [info] 2021-11-25 23:30:30.337 - stdout> > [info] 2021-11-25 23:30:30.339 - stderr> Error in query: > [info] 2021-11-25 23:30:30.339 - stderr> extraneous input '*/' expecting > {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', >
[jira] [Updated] (SPARK-37099) Impl a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-37099: - Description: in JD, we found that more than 90% usage of window function follows this pattern: {code:java} select (... (row_number|rank|dense_rank) () over( [partition by ...] order by ... ) as rn) where rn (==|<|<=) k and other conditions{code} However, existing physical plan is not optimum: 1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount; For these three rank functions (row_number|rank|dense_rank), the rank of a key computed on partitial dataset is always <= its final rank computed on the whole dataset. so we can safely discard rows with partitial rank > rn, anywhere. 2, skewed-window: some partition is skewed and take a long time to finish computation. A real-world skewed-window case in our system is attached. was: in JD, we found that more than 90% usage of window function follows this pattern: {code:java} select (... [row_number|rank|dense_rank]() over([partition by ...] order by ...) as rn) where rn ==[\<=] k and other conditions{code} However, existing physical plan is not optimum: 1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount; For these three rank functions (row_number|rank|dense_rank), the rank of a key computed on partitial dataset is always <= its final rank computed on the whole dataset. so we can safely discard rows with partitial rank > rn, anywhere. 2, skewed-window: some partition is skewed and take a long time to finish computation. A real-world skewed-window case in our system is attached. > Impl a rank-based filter to optimize top-k computation > -- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Major > Attachments: skewed_window.png > > > in JD, we found that more than 90% usage of window function follows this > pattern: > {code:java} > select (... (row_number|rank|dense_rank) () over( [partition by ...] order > by ... ) as rn) > where rn (==|<|<=) k and other conditions{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > For these three rank functions (row_number|rank|dense_rank), the rank of a > key computed on partitial dataset is always <= its final rank computed on > the whole dataset. > so we can safely discard rows with partitial rank > rn, anywhere. > > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37471) spark-sql support nested bracketed comment
angerszhu created SPARK-37471: - Summary: spark-sql support nested bracketed comment Key: SPARK-37471 URL: https://issues.apache.org/jira/browse/SPARK-37471 Project: Spark Issue Type: Task Components: SQL Affects Versions: 3.2.0 Reporter: angerszhu {code} /* select select /* BROADCAST(b) */ 4\\; */ select 1 ; {code} failed in spark-sql -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
[ https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449384#comment-17449384 ] Apache Spark commented on SPARK-37469: -- User 'toujours33' has created a pull request for this issue: https://github.com/apache/spark/pull/34720 > Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI > --- > > Key: SPARK-37469 > URL: https://issues.apache.org/jira/browse/SPARK-37469 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yazhi Wang >Priority: Minor > Attachments: executor-page.png, sql-page.png > > > Metrics in Executor/Task page shown as " > Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which > make us confused !executor-page.png! > !sql-page.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
[ https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449383#comment-17449383 ] Apache Spark commented on SPARK-37469: -- User 'toujours33' has created a pull request for this issue: https://github.com/apache/spark/pull/34720 > Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI > --- > > Key: SPARK-37469 > URL: https://issues.apache.org/jira/browse/SPARK-37469 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yazhi Wang >Priority: Minor > Attachments: executor-page.png, sql-page.png > > > Metrics in Executor/Task page shown as " > Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which > make us confused !executor-page.png! > !sql-page.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
[ https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37469: Assignee: (was: Apache Spark) > Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI > --- > > Key: SPARK-37469 > URL: https://issues.apache.org/jira/browse/SPARK-37469 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yazhi Wang >Priority: Minor > Attachments: executor-page.png, sql-page.png > > > Metrics in Executor/Task page shown as " > Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which > make us confused !executor-page.png! > !sql-page.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
[ https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37469: Assignee: Apache Spark > Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI > --- > > Key: SPARK-37469 > URL: https://issues.apache.org/jira/browse/SPARK-37469 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yazhi Wang >Assignee: Apache Spark >Priority: Minor > Attachments: executor-page.png, sql-page.png > > > Metrics in Executor/Task page shown as " > Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which > make us confused !executor-page.png! > !sql-page.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37470) A new created table gets duplicated after ALTER DATABASE SET LOCATION command
[ https://issues.apache.org/jira/browse/SPARK-37470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449379#comment-17449379 ] Yuto Akutsu commented on SPARK-37470: - I'm working on this. > A new created table gets duplicated after ALTER DATABASE SET LOCATION command > - > > Key: SPARK-37470 > URL: https://issues.apache.org/jira/browse/SPARK-37470 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Yuto Akutsu >Priority: Major > > Creating and saving a new table after changing the location of a database by > ALTER DATABASE SET LOCATION command generates an duplicate of the table in > the default location (which can be defined by spark.sql.warehouse.dir). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37470) A new created table gets duplicated after ALTER DATABASE SET LOCATION command
Yuto Akutsu created SPARK-37470: --- Summary: A new created table gets duplicated after ALTER DATABASE SET LOCATION command Key: SPARK-37470 URL: https://issues.apache.org/jira/browse/SPARK-37470 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Reporter: Yuto Akutsu Creating and saving a new table after changing the location of a database by ALTER DATABASE SET LOCATION command generates an duplicate of the table in the default location (which can be defined by spark.sql.warehouse.dir). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
[ https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yazhi Wang updated SPARK-37469: --- Attachment: executor-page.png sql-page.png > Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI > --- > > Key: SPARK-37469 > URL: https://issues.apache.org/jira/browse/SPARK-37469 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yazhi Wang >Priority: Minor > Attachments: executor-page.png, sql-page.png > > > Metrics in Executor/Task page shown as " > Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which > make us confused > > !image-2021-11-26-12-12-46-896.png! > !image-2021-11-26-12-15-28-204.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
[ https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yazhi Wang updated SPARK-37469: --- Description: Metrics in Executor/Task page shown as " Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which make us confused !executor-page.png! !sql-page.png! was: Metrics in Executor/Task page shown as " Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which make us confused !executor-page.png! > Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI > --- > > Key: SPARK-37469 > URL: https://issues.apache.org/jira/browse/SPARK-37469 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yazhi Wang >Priority: Minor > Attachments: executor-page.png, sql-page.png > > > Metrics in Executor/Task page shown as " > Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which > make us confused !executor-page.png! > !sql-page.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
[ https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449371#comment-17449371 ] Yazhi Wang commented on SPARK-37469: I'm working on it > Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI > --- > > Key: SPARK-37469 > URL: https://issues.apache.org/jira/browse/SPARK-37469 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yazhi Wang >Priority: Minor > Attachments: executor-page.png, sql-page.png > > > Metrics in Executor/Task page shown as " > Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which > make us confused !executor-page.png! > !sql-page.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
[ https://issues.apache.org/jira/browse/SPARK-37469?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yazhi Wang updated SPARK-37469: --- Description: Metrics in Executor/Task page shown as " Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which make us confused !executor-page.png! was: Metrics in Executor/Task page shown as " Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which make us confused !image-2021-11-26-12-12-46-896.png! !image-2021-11-26-12-15-28-204.png! > Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI > --- > > Key: SPARK-37469 > URL: https://issues.apache.org/jira/browse/SPARK-37469 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.2.0 >Reporter: Yazhi Wang >Priority: Minor > Attachments: executor-page.png, sql-page.png > > > Metrics in Executor/Task page shown as " > Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which > make us confused !executor-page.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37469) Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI
Yazhi Wang created SPARK-37469: -- Summary: Unified "fetchWaitTime" and "shuffleReadTime" metrics On UI Key: SPARK-37469 URL: https://issues.apache.org/jira/browse/SPARK-37469 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.2.0 Reporter: Yazhi Wang Metrics in Executor/Task page shown as " Shuffle Read Block Time", and the SQL page shown as "fetch wait time" which make us confused !image-2021-11-26-12-12-46-896.png! !image-2021-11-26-12-15-28-204.png! -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37381) Unify v1 and v2 SHOW CREATE TABLE tests
[ https://issues.apache.org/jira/browse/SPARK-37381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449370#comment-17449370 ] Apache Spark commented on SPARK-37381: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/34719 > Unify v1 and v2 SHOW CREATE TABLE tests > > > Key: SPARK-37381 > URL: https://issues.apache.org/jira/browse/SPARK-37381 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37381) Unify v1 and v2 SHOW CREATE TABLE tests
[ https://issues.apache.org/jira/browse/SPARK-37381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37381: Assignee: Apache Spark > Unify v1 and v2 SHOW CREATE TABLE tests > > > Key: SPARK-37381 > URL: https://issues.apache.org/jira/browse/SPARK-37381 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Assignee: Apache Spark >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37381) Unify v1 and v2 SHOW CREATE TABLE tests
[ https://issues.apache.org/jira/browse/SPARK-37381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37381: Assignee: (was: Apache Spark) > Unify v1 and v2 SHOW CREATE TABLE tests > > > Key: SPARK-37381 > URL: https://issues.apache.org/jira/browse/SPARK-37381 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37381) Unify v1 and v2 SHOW CREATE TABLE tests
[ https://issues.apache.org/jira/browse/SPARK-37381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449369#comment-17449369 ] Apache Spark commented on SPARK-37381: -- User 'Peng-Lei' has created a pull request for this issue: https://github.com/apache/spark/pull/34719 > Unify v1 and v2 SHOW CREATE TABLE tests > > > Key: SPARK-37381 > URL: https://issues.apache.org/jira/browse/SPARK-37381 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
[ https://issues.apache.org/jira/browse/SPARK-37460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449367#comment-17449367 ] Apache Spark commented on SPARK-37460: -- User 'yutoacts' has created a pull request for this issue: https://github.com/apache/spark/pull/34718 > ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented > - > > Key: SPARK-37460 > URL: https://issues.apache.org/jira/browse/SPARK-37460 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Yuto Akutsu >Priority: Minor > > The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ... > {color:#172b4d}command{color}{color} should be documented in a > sql-ref-syntax-ddl-create-table page. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
[ https://issues.apache.org/jira/browse/SPARK-37460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37460: Assignee: Apache Spark > ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented > - > > Key: SPARK-37460 > URL: https://issues.apache.org/jira/browse/SPARK-37460 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Yuto Akutsu >Assignee: Apache Spark >Priority: Minor > > The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ... > {color:#172b4d}command{color}{color} should be documented in a > sql-ref-syntax-ddl-create-table page. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
[ https://issues.apache.org/jira/browse/SPARK-37460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37460: Assignee: (was: Apache Spark) > ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented > - > > Key: SPARK-37460 > URL: https://issues.apache.org/jira/browse/SPARK-37460 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Yuto Akutsu >Priority: Minor > > The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ... > {color:#172b4d}command{color}{color} should be documented in a > sql-ref-syntax-ddl-create-table page. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37099) Impl a rank-based filter to optimize top-k computation
[ https://issues.apache.org/jira/browse/SPARK-37099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng updated SPARK-37099: - Description: in JD, we found that more than 90% usage of window function follows this pattern: {code:java} select (... [row_number|rank|dense_rank]() over([partition by ...] order by ...) as rn) where rn ==[\<=] k and other conditions{code} However, existing physical plan is not optimum: 1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount; For these three rank functions (row_number|rank|dense_rank), the rank of a key computed on partitial dataset is always <= its final rank computed on the whole dataset. so we can safely discard rows with partitial rank > rn, anywhere. 2, skewed-window: some partition is skewed and take a long time to finish computation. A real-world skewed-window case in our system is attached. was: in JD, we found that more than 80% usage of window function follows this pattern: {code:java} select (... row_number() over(partition by ... order by ...) as rn) where rn ==[\<=] k{code} However, existing physical plan is not optimum: 1, we should select local top-k records within each partitions, and then compute the global top-k. this can help reduce the shuffle amount; 2, skewed-window: some partition is skewed and take a long time to finish computation. A real-world skewed-window case in our system is attached. > Impl a rank-based filter to optimize top-k computation > -- > > Key: SPARK-37099 > URL: https://issues.apache.org/jira/browse/SPARK-37099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: zhengruifeng >Priority: Major > Attachments: skewed_window.png > > > in JD, we found that more than 90% usage of window function follows this > pattern: > {code:java} > select (... [row_number|rank|dense_rank]() over([partition by ...] order by > ...) as rn) > where rn ==[\<=] k and other conditions{code} > > However, existing physical plan is not optimum: > > 1, we should select local top-k records within each partitions, and then > compute the global top-k. this can help reduce the shuffle amount; > > For these three rank functions (row_number|rank|dense_rank), the rank of a > key computed on partitial dataset is always <= its final rank computed on > the whole dataset. > so we can safely discard rows with partitial rank > rn, anywhere. > > > 2, skewed-window: some partition is skewed and take a long time to finish > computation. > > A real-world skewed-window case in our system is attached. > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37468) Support ANSI intervals and TimestampNTZ for UnionEstimation
[ https://issues.apache.org/jira/browse/SPARK-37468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37468: Assignee: Kousuke Saruta (was: Apache Spark) > Support ANSI intervals and TimestampNTZ for UnionEstimation > --- > > Key: SPARK-37468 > URL: https://issues.apache.org/jira/browse/SPARK-37468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Currently, UnionEstimation doesn't support ANSI intervals and TimestampNTZ. > But I think it can support those types because their underlying types are > integer or long, which UnionEstimation can compute stats for. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37468) Support ANSI intervals and TimestampNTZ for UnionEstimation
[ https://issues.apache.org/jira/browse/SPARK-37468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449364#comment-17449364 ] Apache Spark commented on SPARK-37468: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/34716 > Support ANSI intervals and TimestampNTZ for UnionEstimation > --- > > Key: SPARK-37468 > URL: https://issues.apache.org/jira/browse/SPARK-37468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Currently, UnionEstimation doesn't support ANSI intervals and TimestampNTZ. > But I think it can support those types because their underlying types are > integer or long, which UnionEstimation can compute stats for. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37468) Support ANSI intervals and TimestampNTZ for UnionEstimation
[ https://issues.apache.org/jira/browse/SPARK-37468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37468: Assignee: Apache Spark (was: Kousuke Saruta) > Support ANSI intervals and TimestampNTZ for UnionEstimation > --- > > Key: SPARK-37468 > URL: https://issues.apache.org/jira/browse/SPARK-37468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark >Priority: Major > > Currently, UnionEstimation doesn't support ANSI intervals and TimestampNTZ. > But I think it can support those types because their underlying types are > integer or long, which UnionEstimation can compute stats for. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37468) Support ANSI intervals and TimestampNTZ for UnionEstimation
[ https://issues.apache.org/jira/browse/SPARK-37468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-37468: --- Description: Currently, UnionEstimation doesn't support ANSI intervals and TimestampNTZ. But I think it can support those types because their underlying types are integer or long, which UnionEstimation can compute stats for. (was: Currently, UnionEstimation doesn't support ANSI intervals and TimestampNTZ. But I think it can support those types because their underlying types are integer or long, which it UnionEstimation can compute stats for.) > Support ANSI intervals and TimestampNTZ for UnionEstimation > --- > > Key: SPARK-37468 > URL: https://issues.apache.org/jira/browse/SPARK-37468 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > Currently, UnionEstimation doesn't support ANSI intervals and TimestampNTZ. > But I think it can support those types because their underlying types are > integer or long, which UnionEstimation can compute stats for. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37468) Support ANSI intervals and TimestampNTZ for UnionEstimation
Kousuke Saruta created SPARK-37468: -- Summary: Support ANSI intervals and TimestampNTZ for UnionEstimation Key: SPARK-37468 URL: https://issues.apache.org/jira/browse/SPARK-37468 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Currently, UnionEstimation doesn't support ANSI intervals and TimestampNTZ. But I think it can support those types because their underlying types are integer or long, which it UnionEstimation can compute stats for. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449362#comment-17449362 ] Apache Spark commented on SPARK-37445: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34715 > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37465) PySpark tests failing on Pandas 0.23
[ https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449358#comment-17449358 ] Hyukjin Kwon commented on SPARK-37465: -- cc [~yikunkero] and [~itholic] FYI > PySpark tests failing on Pandas 0.23 > > > Key: SPARK-37465 > URL: https://issues.apache.org/jira/browse/SPARK-37465 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Willi Raschkowski >Priority: Major > > I was running Spark tests with Pandas {{0.23.4}} and got the error below. The > minimum Pandas version is currently {{0.23.2}} > [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. > Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix > (Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222] > in Pandas. > {code:java} > $ python/run-tests --testnames > 'pyspark.pandas.tests.data_type_ops.test_boolean_ops > BooleanOpsTest.test_floordiv' > ... > == > ERROR [5.785s]: test_floordiv > (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest) > -- > Traceback (most recent call last): > File > "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py", > line 128, in test_floordiv > self.assert_eq(b_pser // b_pser.astype(int), b_psser // > b_psser.astype(int)) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", > line 1069, in wrapper > result = safe_na_op(lvalues, rvalues) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", > line 1033, in safe_na_op > return na_op(lvalues, rvalues) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", > line 1027, in na_op > result = missing.fill_zeros(result, x, y, op_name, fill_zeros) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py", > line 641, in fill_zeros > signs = np.sign(y if name.startswith(('r', '__r')) else x) > TypeError: ufunc 'sign' did not contain a loop with signature matching types > dtype('bool') dtype('bool') > {code} > These are my relevant package versions: > {code:java} > $ conda list | grep -e numpy -e pyarrow -e pandas -e python > # packages in environment at /home/circleci/miniconda/envs/python3: > numpy 1.16.6 py36h0a8e133_3 > numpy-base1.16.6 py36h41b4c56_3 > pandas0.23.4 py36h04863e7_0 > pyarrow 1.0.1 py36h6200943_36_cpuconda-forge > python3.6.12 hcff3b4d_2anaconda > python-dateutil 2.8.1 py_0anaconda > python_abi3.6 1_cp36mconda-forg > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37465) PySpark tests failing on Pandas 0.23
[ https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449357#comment-17449357 ] Hyukjin Kwon commented on SPARK-37465: -- Maybe we should rather bump up the minimum pandas version to 1.0.0. Would you be interested in submitting a PR? cc [~XinrongM] [~ueshin] FYI > PySpark tests failing on Pandas 0.23 > > > Key: SPARK-37465 > URL: https://issues.apache.org/jira/browse/SPARK-37465 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Willi Raschkowski >Priority: Major > > I was running Spark tests with Pandas {{0.23.4}} and got the error below. The > minimum Pandas version is currently {{0.23.2}} > [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. > Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix > (Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222] > in Pandas. > {code:java} > $ python/run-tests --testnames > 'pyspark.pandas.tests.data_type_ops.test_boolean_ops > BooleanOpsTest.test_floordiv' > ... > == > ERROR [5.785s]: test_floordiv > (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest) > -- > Traceback (most recent call last): > File > "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py", > line 128, in test_floordiv > self.assert_eq(b_pser // b_pser.astype(int), b_psser // > b_psser.astype(int)) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", > line 1069, in wrapper > result = safe_na_op(lvalues, rvalues) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", > line 1033, in safe_na_op > return na_op(lvalues, rvalues) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", > line 1027, in na_op > result = missing.fill_zeros(result, x, y, op_name, fill_zeros) > File > "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py", > line 641, in fill_zeros > signs = np.sign(y if name.startswith(('r', '__r')) else x) > TypeError: ufunc 'sign' did not contain a loop with signature matching types > dtype('bool') dtype('bool') > {code} > These are my relevant package versions: > {code:java} > $ conda list | grep -e numpy -e pyarrow -e pandas -e python > # packages in environment at /home/circleci/miniconda/envs/python3: > numpy 1.16.6 py36h0a8e133_3 > numpy-base1.16.6 py36h41b4c56_3 > pandas0.23.4 py36h04863e7_0 > pyarrow 1.0.1 py36h6200943_36_cpuconda-forge > python3.6.12 hcff3b4d_2anaconda > python-dateutil 2.8.1 py_0anaconda > python_abi3.6 1_cp36mconda-forg > {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37457) Update cloudpickle to v2.0.0
[ https://issues.apache.org/jira/browse/SPARK-37457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-37457: Assignee: Hyukjin Kwon > Update cloudpickle to v2.0.0 > > > Key: SPARK-37457 > URL: https://issues.apache.org/jira/browse/SPARK-37457 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Cloudpickle 1.6.0 is released out. We should better match it to the latest > version. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37457) Update cloudpickle to v2.0.0
[ https://issues.apache.org/jira/browse/SPARK-37457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37457. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34705 [https://github.com/apache/spark/pull/34705] > Update cloudpickle to v2.0.0 > > > Key: SPARK-37457 > URL: https://issues.apache.org/jira/browse/SPARK-37457 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > Cloudpickle 1.6.0 is released out. We should better match it to the latest > version. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37381) Unify v1 and v2 SHOW CREATE TABLE tests
[ https://issues.apache.org/jira/browse/SPARK-37381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449352#comment-17449352 ] dch nguyen commented on SPARK-37381: Can I go for it? I want to try to fix it [~xiaopenglei] [~maxgekk] > Unify v1 and v2 SHOW CREATE TABLE tests > > > Key: SPARK-37381 > URL: https://issues.apache.org/jira/browse/SPARK-37381 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: PengLei >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37436) Uses Python's standard string formatter for SQL API in pandas API on Spark
[ https://issues.apache.org/jira/browse/SPARK-37436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-37436. -- Fix Version/s: 3.3.0 Assignee: Hyukjin Kwon Resolution: Fixed Fixed in https://github.com/apache/spark/pull/34677 > Uses Python's standard string formatter for SQL API in pandas API on Spark > -- > > Key: SPARK-37436 > URL: https://issues.apache.org/jira/browse/SPARK-37436 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > Currently {{pyspark.pandas.sql}} uses its own hacky parser: > https://github.com/apache/spark/blob/master/python/pyspark/pandas/sql_processor.py > We should ideally switch it to the standard Python formatter > https://docs.python.org/3/library/string.html#custom-string-formatting -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37467) Consolidate whole stage and non-whole stage subexpression elimination
Adam Binford created SPARK-37467: Summary: Consolidate whole stage and non-whole stage subexpression elimination Key: SPARK-37467 URL: https://issues.apache.org/jira/browse/SPARK-37467 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Adam Binford Currently there are separate subexpression elimination paths for whole stage and non-whole stage codegen. Consolidating these into a single code path would make it simpler to add further enhancements, such as supporting lambda functions (https://issues.apache.org/jira/browse/SPARK-37466). -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37466) Support subexpression elimination in lambda functions
Adam Binford created SPARK-37466: Summary: Support subexpression elimination in lambda functions Key: SPARK-37466 URL: https://issues.apache.org/jira/browse/SPARK-37466 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Adam Binford https://issues.apache.org/jira/browse/SPARK-37019 will add codegen support for higher order functions. However we can't support subexpression elimination inside of lambda functions because subexpressions are evaluated once per row at the beginning of the codegen. Common expressions inside lambda functions can easily result in performance degradation due to multiple evaluations of the same expression. Subexpression elimination inside of lambda functions needs to be handled specially to be evaluated once per function invocation. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37445: Assignee: Apache Spark > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37445: Assignee: (was: Apache Spark) > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-37445: -- Assignee: (was: angerszhu) Reverted at https://github.com/apache/spark/commit/444cfe66a65fbdbda53366154cf547de90309608 > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-37445: - Fix Version/s: (was: 3.3.0) > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32079) PySpark <> Beam pickling issues for collections.namedtuple
[ https://issues.apache.org/jira/browse/SPARK-32079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32079: Assignee: Hyukjin Kwon > PySpark <> Beam pickling issues for collections.namedtuple > -- > > Key: SPARK-32079 > URL: https://issues.apache.org/jira/browse/SPARK-32079 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Gerard Casas Saez >Assignee: Hyukjin Kwon >Priority: Major > > PySpark monkeypatching namedtuple makes it difficult/impossible to depickle > collections.namedtuple instances from outside of a pyspark environment. > > When PySpark has been loaded into the environment, any time that you try to > pickle a namedtuple, you are only able to unpickle it from an environment > where the > [hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385] > has been applied. > This conflicts directly when trying to use Beam from a non-Spark environment > (namingly Flink or Dataflow) making it impossible to use the pipeline if it > has a namedtuple loaded somewhere. > > {code:python} > import collections > import dill > ColumnInfo = collections.namedtuple( > "ColumnInfo", > [ > "name", # type: ColumnName # pytype: disable=ignored-type-comment > "type", # type: Optional[ColumnType] # pytype: > disable=ignored-type-comment > ]) > dill.dumps(ColumnInfo('test', int)) > {code} > {{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}} > {code:python} > import pyspark > import collections > import dill > ColumnInfo = collections.namedtuple( > "ColumnInfo", > [ > "name", # type: ColumnName # pytype: disable=ignored-type-comment > "type", # type: Optional[ColumnType] # pytype: > disable=ignored-type-comment > ]) > dill.dumps(ColumnInfo('test', int)) > {code} > {{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}} > Second pickled object can only be used from an environment with PySpark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32079) PySpark <> Beam pickling issues for collections.namedtuple
[ https://issues.apache.org/jira/browse/SPARK-32079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32079. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34688 [https://github.com/apache/spark/pull/34688] > PySpark <> Beam pickling issues for collections.namedtuple > -- > > Key: SPARK-32079 > URL: https://issues.apache.org/jira/browse/SPARK-32079 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Gerard Casas Saez >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.3.0 > > > PySpark monkeypatching namedtuple makes it difficult/impossible to depickle > collections.namedtuple instances from outside of a pyspark environment. > > When PySpark has been loaded into the environment, any time that you try to > pickle a namedtuple, you are only able to unpickle it from an environment > where the > [hijack|https://github.com/apache/spark/blob/master/python/pyspark/serializers.py#L385] > has been applied. > This conflicts directly when trying to use Beam from a non-Spark environment > (namingly Flink or Dataflow) making it impossible to use the pipeline if it > has a namedtuple loaded somewhere. > > {code:python} > import collections > import dill > ColumnInfo = collections.namedtuple( > "ColumnInfo", > [ > "name", # type: ColumnName # pytype: disable=ignored-type-comment > "type", # type: Optional[ColumnType] # pytype: > disable=ignored-type-comment > ]) > dill.dumps(ColumnInfo('test', int)) > {code} > {{b'\x80\x03cdill._dill\n_create_namedtuple\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x08\x00\x00\x00__main__q\x05\x87q\x06Rq\x07X\x04\x00\x00\x00testq\x08cdill._dill\n_load_type\nq\tX\x03\x00\x00\x00intq\n\x85q\x0bRq\x0c\x86q\r\x81q\x0e.'}} > {code:python} > import pyspark > import collections > import dill > ColumnInfo = collections.namedtuple( > "ColumnInfo", > [ > "name", # type: ColumnName # pytype: disable=ignored-type-comment > "type", # type: Optional[ColumnType] # pytype: > disable=ignored-type-comment > ]) > dill.dumps(ColumnInfo('test', int)) > {code} > {{b'\x80\x03cpyspark.serializers\n_restore\nq\x00X\n\x00\x00\x00ColumnInfoq\x01X\x04\x00\x00\x00nameq\x02X\x04\x00\x00\x00typeq\x03\x86q\x04X\x04\x00\x00\x00testq\x05cdill._dill\n_load_type\nq\x06X\x03\x00\x00\x00intq\x07\x85q\x08Rq\t\x86q\n\x87q\x0bRq\x0c.'}} > Second pickled object can only be used from an environment with PySpark. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37465) PySpark tests failing on Pandas 0.23
[ https://issues.apache.org/jira/browse/SPARK-37465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Willi Raschkowski updated SPARK-37465: -- Description: I was running Spark tests with Pandas {{0.23.4}} and got the error below. The minimum Pandas version is currently {{0.23.2}} [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix (Github)|https://github.com/pandas-dev/pandas/pull/21160/files#diff-1b7183f5b3970e2a1d39a82d71686e39c765d18a34231b54c857b0c4c9bb8222] in Pandas. {code:java} $ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_boolean_ops BooleanOpsTest.test_floordiv' ... == ERROR [5.785s]: test_floordiv (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest) -- Traceback (most recent call last): File "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py", line 128, in test_floordiv self.assert_eq(b_pser // b_pser.astype(int), b_psser // b_psser.astype(int)) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1069, in wrapper result = safe_na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1033, in safe_na_op return na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1027, in na_op result = missing.fill_zeros(result, x, y, op_name, fill_zeros) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py", line 641, in fill_zeros signs = np.sign(y if name.startswith(('r', '__r')) else x) TypeError: ufunc 'sign' did not contain a loop with signature matching types dtype('bool') dtype('bool') {code} These are my relevant package versions: {code:java} $ conda list | grep -e numpy -e pyarrow -e pandas -e python # packages in environment at /home/circleci/miniconda/envs/python3: numpy 1.16.6 py36h0a8e133_3 numpy-base1.16.6 py36h41b4c56_3 pandas0.23.4 py36h04863e7_0 pyarrow 1.0.1 py36h6200943_36_cpuconda-forge python3.6.12 hcff3b4d_2anaconda python-dateutil 2.8.1 py_0anaconda python_abi3.6 1_cp36mconda-forg {code} was: I was running Spark tests with Pandas {{0.23.4}} and got the error below. The minimum Pandas version is currently {{0.23.2}} [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix (Github)|https://github.com/pandas-dev/pandas/pull/21160] in Pandas. {code:java} $ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_boolean_ops BooleanOpsTest.test_floordiv' ... == ERROR [5.785s]: test_floordiv (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest) -- Traceback (most recent call last): File "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py", line 128, in test_floordiv self.assert_eq(b_pser // b_pser.astype(int), b_psser // b_psser.astype(int)) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1069, in wrapper result = safe_na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1033, in safe_na_op return na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1027, in na_op result = missing.fill_zeros(result, x, y, op_name, fill_zeros) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py", line 641, in fill_zeros signs = np.sign(y if name.startswith(('r', '__r')) else x) TypeError: ufunc 'sign' did not contain a loop with signature matching types dtype('bool') dtype('bool') {code} These are my relevant package versions: {code:java} $ conda list | grep -e numpy -e pyarrow -e pandas -e python # packages in environment at /home/circleci/miniconda/envs/python3: numpy 1.16.6 py36h0a8e133_3 numpy-base1.16.6 py36h41b4c56_3 pandas0.23.4 py36h04863e7_0 pyarrow 1.0.1 py36h6200943_36_cpuconda-forge python3.6.12 hcff3b4d_2
[jira] [Created] (SPARK-37465) PySpark tests failing on Pandas 0.23
Willi Raschkowski created SPARK-37465: - Summary: PySpark tests failing on Pandas 0.23 Key: SPARK-37465 URL: https://issues.apache.org/jira/browse/SPARK-37465 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.2.0 Reporter: Willi Raschkowski I was running Spark tests with Pandas {{0.23.4}} and got the error below. The minimum Pandas version is currently {{0.23.2}} [(Github)|https://github.com/apache/spark/blob/v3.2.0/python/setup.py#L114]. Upgrading to {{0.24.0}} fixes the error. I think Spark needs [this fix (Github)|https://github.com/pandas-dev/pandas/pull/21160] in Pandas. {code:java} $ python/run-tests --testnames 'pyspark.pandas.tests.data_type_ops.test_boolean_ops BooleanOpsTest.test_floordiv' ... == ERROR [5.785s]: test_floordiv (pyspark.pandas.tests.data_type_ops.test_boolean_ops.BooleanOpsTest) -- Traceback (most recent call last): File "/home/circleci/project/python/pyspark/pandas/tests/data_type_ops/test_boolean_ops.py", line 128, in test_floordiv self.assert_eq(b_pser // b_pser.astype(int), b_psser // b_psser.astype(int)) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1069, in wrapper result = safe_na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1033, in safe_na_op return na_op(lvalues, rvalues) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/ops.py", line 1027, in na_op result = missing.fill_zeros(result, x, y, op_name, fill_zeros) File "/home/circleci/miniconda/envs/python3/lib/python3.6/site-packages/pandas/core/missing.py", line 641, in fill_zeros signs = np.sign(y if name.startswith(('r', '__r')) else x) TypeError: ufunc 'sign' did not contain a loop with signature matching types dtype('bool') dtype('bool') {code} These are my relevant package versions: {code:java} $ conda list | grep -e numpy -e pyarrow -e pandas -e python # packages in environment at /home/circleci/miniconda/envs/python3: numpy 1.16.6 py36h0a8e133_3 numpy-base1.16.6 py36h41b4c56_3 pandas0.23.4 py36h04863e7_0 pyarrow 1.0.1 py36h6200943_36_cpuconda-forge python3.6.12 hcff3b4d_2anaconda python-dateutil 2.8.1 py_0anaconda python_abi3.6 1_cp36mconda-forg {code} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37464) SCHEMA and DATABASE should simply be aliases of NAMESPACE
[ https://issues.apache.org/jira/browse/SPARK-37464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37464: Assignee: Apache Spark > SCHEMA and DATABASE should simply be aliases of NAMESPACE > - > > Key: SPARK-37464 > URL: https://issues.apache.org/jira/browse/SPARK-37464 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37464) SCHEMA and DATABASE should simply be aliases of NAMESPACE
[ https://issues.apache.org/jira/browse/SPARK-37464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37464: Assignee: (was: Apache Spark) > SCHEMA and DATABASE should simply be aliases of NAMESPACE > - > > Key: SPARK-37464 > URL: https://issues.apache.org/jira/browse/SPARK-37464 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37464) SCHEMA and DATABASE should simply be aliases of NAMESPACE
[ https://issues.apache.org/jira/browse/SPARK-37464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449227#comment-17449227 ] Apache Spark commented on SPARK-37464: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/34713 > SCHEMA and DATABASE should simply be aliases of NAMESPACE > - > > Key: SPARK-37464 > URL: https://issues.apache.org/jira/browse/SPARK-37464 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37464) SCHEMA and DATABASE should simply be aliases of NAMESPACE
Wenchen Fan created SPARK-37464: --- Summary: SCHEMA and DATABASE should simply be aliases of NAMESPACE Key: SPARK-37464 URL: https://issues.apache.org/jira/browse/SPARK-37464 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37463) Read/Write Timestamp ntz or ltz to Orc uses UTC timestamp
[ https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-37463: --- Summary: Read/Write Timestamp ntz or ltz to Orc uses UTC timestamp (was: read/write Timestamp ntz or ltz to Orc uses UTC timestamp) > Read/Write Timestamp ntz or ltz to Orc uses UTC timestamp > - > > Key: SPARK-37463 > URL: https://issues.apache.org/jira/browse/SPARK-37463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > There are some example code: > import java.util.TimeZone > TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) > sql("set spark.sql.session.timeZone=America/Los_Angeles") > val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp > '2021-06-01 00:00:00' ts") > df.write.mode("overwrite").orc("ts_ntz_orc") > df.write.mode("overwrite").parquet("ts_ntz_parquet") > df.write.mode("overwrite").format("avro").save("ts_ntz_avro") > val query = """ > select 'orc', * > from `orc`.`ts_ntz_orc` > union all > select 'parquet', * > from `parquet`.`ts_ntz_parquet` > union all > select 'avro', * > from `avro`.`ts_ntz_avro` > """ > val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam") > for (tz <- tzs) { > TimeZone.setDefault(TimeZone.getTimeZone(tz)) > sql(s"set spark.sql.session.timeZone=$tz") > println(s"Time zone is ${TimeZone.getDefault.getID}") > sql(query).show(false) > } > The output show below looks so strange. > Time zone is America/Los_Angeles > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-06-01 00:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 00:00:00| > +---+---+---+ > Time zone is UTC > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 17:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 07:00:00| > +---+---+---+ > Time zone is Europe/Amsterdam > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 15:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 09:00:00| > +---+---+---+ -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37463) read/write Timestamp ntz or ltz to Orc uses UTC timestamp
[ https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37463: Assignee: Apache Spark > read/write Timestamp ntz or ltz to Orc uses UTC timestamp > - > > Key: SPARK-37463 > URL: https://issues.apache.org/jira/browse/SPARK-37463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > There are some example code: > import java.util.TimeZone > TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) > sql("set spark.sql.session.timeZone=America/Los_Angeles") > val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp > '2021-06-01 00:00:00' ts") > df.write.mode("overwrite").orc("ts_ntz_orc") > df.write.mode("overwrite").parquet("ts_ntz_parquet") > df.write.mode("overwrite").format("avro").save("ts_ntz_avro") > val query = """ > select 'orc', * > from `orc`.`ts_ntz_orc` > union all > select 'parquet', * > from `parquet`.`ts_ntz_parquet` > union all > select 'avro', * > from `avro`.`ts_ntz_avro` > """ > val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam") > for (tz <- tzs) { > TimeZone.setDefault(TimeZone.getTimeZone(tz)) > sql(s"set spark.sql.session.timeZone=$tz") > println(s"Time zone is ${TimeZone.getDefault.getID}") > sql(query).show(false) > } > The output show below looks so strange. > Time zone is America/Los_Angeles > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-06-01 00:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 00:00:00| > +---+---+---+ > Time zone is UTC > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 17:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 07:00:00| > +---+---+---+ > Time zone is Europe/Amsterdam > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 15:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 09:00:00| > +---+---+---+ -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37463) read/write Timestamp ntz or ltz to Orc uses UTC timestamp
[ https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449179#comment-17449179 ] Apache Spark commented on SPARK-37463: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/34712 > read/write Timestamp ntz or ltz to Orc uses UTC timestamp > - > > Key: SPARK-37463 > URL: https://issues.apache.org/jira/browse/SPARK-37463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > There are some example code: > import java.util.TimeZone > TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) > sql("set spark.sql.session.timeZone=America/Los_Angeles") > val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp > '2021-06-01 00:00:00' ts") > df.write.mode("overwrite").orc("ts_ntz_orc") > df.write.mode("overwrite").parquet("ts_ntz_parquet") > df.write.mode("overwrite").format("avro").save("ts_ntz_avro") > val query = """ > select 'orc', * > from `orc`.`ts_ntz_orc` > union all > select 'parquet', * > from `parquet`.`ts_ntz_parquet` > union all > select 'avro', * > from `avro`.`ts_ntz_avro` > """ > val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam") > for (tz <- tzs) { > TimeZone.setDefault(TimeZone.getTimeZone(tz)) > sql(s"set spark.sql.session.timeZone=$tz") > println(s"Time zone is ${TimeZone.getDefault.getID}") > sql(query).show(false) > } > The output show below looks so strange. > Time zone is America/Los_Angeles > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-06-01 00:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 00:00:00| > +---+---+---+ > Time zone is UTC > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 17:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 07:00:00| > +---+---+---+ > Time zone is Europe/Amsterdam > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 15:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 09:00:00| > +---+---+---+ -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37463) read/write Timestamp ntz or ltz to Orc uses UTC timestamp
[ https://issues.apache.org/jira/browse/SPARK-37463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37463: Assignee: (was: Apache Spark) > read/write Timestamp ntz or ltz to Orc uses UTC timestamp > - > > Key: SPARK-37463 > URL: https://issues.apache.org/jira/browse/SPARK-37463 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: jiaan.geng >Priority: Major > > There are some example code: > import java.util.TimeZone > TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) > sql("set spark.sql.session.timeZone=America/Los_Angeles") > val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp > '2021-06-01 00:00:00' ts") > df.write.mode("overwrite").orc("ts_ntz_orc") > df.write.mode("overwrite").parquet("ts_ntz_parquet") > df.write.mode("overwrite").format("avro").save("ts_ntz_avro") > val query = """ > select 'orc', * > from `orc`.`ts_ntz_orc` > union all > select 'parquet', * > from `parquet`.`ts_ntz_parquet` > union all > select 'avro', * > from `avro`.`ts_ntz_avro` > """ > val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam") > for (tz <- tzs) { > TimeZone.setDefault(TimeZone.getTimeZone(tz)) > sql(s"set spark.sql.session.timeZone=$tz") > println(s"Time zone is ${TimeZone.getDefault.getID}") > sql(query).show(false) > } > The output show below looks so strange. > Time zone is America/Los_Angeles > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-06-01 00:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 00:00:00| > +---+---+---+ > Time zone is UTC > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 17:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 07:00:00| > +---+---+---+ > Time zone is Europe/Amsterdam > +---+---+---+ > |orc|ts_ntz |ts | > +---+---+---+ > |orc|2021-05-31 15:00:00|2021-06-01 00:00:00| > |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00| > |avro |2021-06-01 00:00:00|2021-06-01 09:00:00| > +---+---+---+ -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37463) read/write Timestamp ntz or ltz to Orc uses UTC timestamp
jiaan.geng created SPARK-37463: -- Summary: read/write Timestamp ntz or ltz to Orc uses UTC timestamp Key: SPARK-37463 URL: https://issues.apache.org/jira/browse/SPARK-37463 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0 Reporter: jiaan.geng There are some example code: import java.util.TimeZone TimeZone.setDefault(TimeZone.getTimeZone("America/Los_Angeles")) sql("set spark.sql.session.timeZone=America/Los_Angeles") val df = sql("select timestamp_ntz '2021-06-01 00:00:00' ts_ntz, timestamp '2021-06-01 00:00:00' ts") df.write.mode("overwrite").orc("ts_ntz_orc") df.write.mode("overwrite").parquet("ts_ntz_parquet") df.write.mode("overwrite").format("avro").save("ts_ntz_avro") val query = """ select 'orc', * from `orc`.`ts_ntz_orc` union all select 'parquet', * from `parquet`.`ts_ntz_parquet` union all select 'avro', * from `avro`.`ts_ntz_avro` """ val tzs = Seq("America/Los_Angeles", "UTC", "Europe/Amsterdam") for (tz <- tzs) { TimeZone.setDefault(TimeZone.getTimeZone(tz)) sql(s"set spark.sql.session.timeZone=$tz") println(s"Time zone is ${TimeZone.getDefault.getID}") sql(query).show(false) } The output show below looks so strange. Time zone is America/Los_Angeles +---+---+---+ |orc|ts_ntz |ts | +---+---+---+ |orc|2021-06-01 00:00:00|2021-06-01 00:00:00| |parquet|2021-06-01 00:00:00|2021-06-01 00:00:00| |avro |2021-06-01 00:00:00|2021-06-01 00:00:00| +---+---+---+ Time zone is UTC +---+---+---+ |orc|ts_ntz |ts | +---+---+---+ |orc|2021-05-31 17:00:00|2021-06-01 00:00:00| |parquet|2021-06-01 00:00:00|2021-06-01 07:00:00| |avro |2021-06-01 00:00:00|2021-06-01 07:00:00| +---+---+---+ Time zone is Europe/Amsterdam +---+---+---+ |orc|ts_ntz |ts | +---+---+---+ |orc|2021-05-31 15:00:00|2021-06-01 00:00:00| |parquet|2021-06-01 00:00:00|2021-06-01 09:00:00| |avro |2021-06-01 00:00:00|2021-06-01 09:00:00| +---+---+---+ -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37462) Avoid unnecessary calculating the number of outstanding fetch requests and RPCS
[ https://issues.apache.org/jira/browse/SPARK-37462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449170#comment-17449170 ] Apache Spark commented on SPARK-37462: -- User 'weixiuli' has created a pull request for this issue: https://github.com/apache/spark/pull/34711 > Avoid unnecessary calculating the number of outstanding fetch requests and > RPCS > > > Key: SPARK-37462 > URL: https://issues.apache.org/jira/browse/SPARK-37462 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0, 3.2.0 >Reporter: weixiuli >Priority: Major > > It is unnecessary to calculate the number of outstanding fetch requests and > RPCS when the IdleStateEvent is not IDLE or the last request is not timeout. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37462) Avoid unnecessary calculating the number of outstanding fetch requests and RPCS
[ https://issues.apache.org/jira/browse/SPARK-37462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37462: Assignee: Apache Spark > Avoid unnecessary calculating the number of outstanding fetch requests and > RPCS > > > Key: SPARK-37462 > URL: https://issues.apache.org/jira/browse/SPARK-37462 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0, 3.2.0 >Reporter: weixiuli >Assignee: Apache Spark >Priority: Major > > It is unnecessary to calculate the number of outstanding fetch requests and > RPCS when the IdleStateEvent is not IDLE or the last request is not timeout. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37462) Avoid unnecessary calculating the number of outstanding fetch requests and RPCS
[ https://issues.apache.org/jira/browse/SPARK-37462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37462: Assignee: (was: Apache Spark) > Avoid unnecessary calculating the number of outstanding fetch requests and > RPCS > > > Key: SPARK-37462 > URL: https://issues.apache.org/jira/browse/SPARK-37462 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0, 3.2.0 >Reporter: weixiuli >Priority: Major > > It is unnecessary to calculate the number of outstanding fetch requests and > RPCS when the IdleStateEvent is not IDLE or the last request is not timeout. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37445: --- Assignee: angerszhu > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37445) Update hadoop-profile
[ https://issues.apache.org/jira/browse/SPARK-37445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37445. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34689 [https://github.com/apache/spark/pull/34689] > Update hadoop-profile > - > > Key: SPARK-37445 > URL: https://issues.apache.org/jira/browse/SPARK-37445 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0 > > > Current hadoop profile is hadoop-3.2, update to hadoop-3.3, -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37462) Avoid unnecessary calculating the number of outstanding fetch requests and RPCS
[ https://issues.apache.org/jira/browse/SPARK-37462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weixiuli updated SPARK-37462: - Description: It is unnecessary to calculate the number of outstanding fetch requests and RPCS when the IdleStateEvent is not IDLE or the last request is not timeout. (was: To avoid unnecessary calculation of outstanding fetch requests and RPCS) Summary: Avoid unnecessary calculating the number of outstanding fetch requests and RPCS (was: To avoid unnecessary calculation of outstanding fetch requests and RPCS) > Avoid unnecessary calculating the number of outstanding fetch requests and > RPCS > > > Key: SPARK-37462 > URL: https://issues.apache.org/jira/browse/SPARK-37462 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0, 3.2.0 >Reporter: weixiuli >Priority: Major > > It is unnecessary to calculate the number of outstanding fetch requests and > RPCS when the IdleStateEvent is not IDLE or the last request is not timeout. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37462) To avoid unnecessary calculation of outstanding fetch requests and RPCS
[ https://issues.apache.org/jira/browse/SPARK-37462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] weixiuli updated SPARK-37462: - Description: To avoid unnecessary calculation of outstanding fetch requests and RPCS (was: To avoid unnecessary flight request calculations) Summary: To avoid unnecessary calculation of outstanding fetch requests and RPCS (was: To avoid unnecessary flight request calculations) > To avoid unnecessary calculation of outstanding fetch requests and RPCS > --- > > Key: SPARK-37462 > URL: https://issues.apache.org/jira/browse/SPARK-37462 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 3.1.0, 3.2.0 >Reporter: weixiuli >Priority: Major > > To avoid unnecessary calculation of outstanding fetch requests and RPCS -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37311) Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37311. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34708 [https://github.com/apache/spark/pull/34708] > Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default > - > > Key: SPARK-37311 > URL: https://issues.apache.org/jira/browse/SPARK-37311 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.3.0 > > > Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37311) Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37311: --- Assignee: Terry Kim > Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default > - > > Key: SPARK-37311 > URL: https://issues.apache.org/jira/browse/SPARK-37311 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > > Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37462) To avoid unnecessary flight request calculations
weixiuli created SPARK-37462: Summary: To avoid unnecessary flight request calculations Key: SPARK-37462 URL: https://issues.apache.org/jira/browse/SPARK-37462 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 3.2.0, 3.1.0 Reporter: weixiuli To avoid unnecessary flight request calculations -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37461) yarn-client mode client's appid value is null
[ https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449135#comment-17449135 ] Apache Spark commented on SPARK-37461: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34710 > yarn-client mode client's appid value is null > - > > Key: SPARK-37461 > URL: https://issues.apache.org/jira/browse/SPARK-37461 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37461) yarn-client mode client's appid value is null
[ https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37461: Assignee: (was: Apache Spark) > yarn-client mode client's appid value is null > - > > Key: SPARK-37461 > URL: https://issues.apache.org/jira/browse/SPARK-37461 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37461) yarn-client mode client's appid value is null
[ https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37461: Assignee: Apache Spark > yarn-client mode client's appid value is null > - > > Key: SPARK-37461 > URL: https://issues.apache.org/jira/browse/SPARK-37461 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37461) yarn-client mode client's appid value is null
[ https://issues.apache.org/jira/browse/SPARK-37461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449133#comment-17449133 ] Apache Spark commented on SPARK-37461: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/34710 > yarn-client mode client's appid value is null > - > > Key: SPARK-37461 > URL: https://issues.apache.org/jira/browse/SPARK-37461 > Project: Spark > Issue Type: Task > Components: YARN >Affects Versions: 3.2.0 >Reporter: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37461) yarn-client mode client's appid value is null
angerszhu created SPARK-37461: - Summary: yarn-client mode client's appid value is null Key: SPARK-37461 URL: https://issues.apache.org/jira/browse/SPARK-37461 Project: Spark Issue Type: Task Components: YARN Affects Versions: 3.2.0 Reporter: angerszhu -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37259) JDBC read is always going to wrap the query in a select statement
[ https://issues.apache.org/jira/browse/SPARK-37259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449072#comment-17449072 ] Apache Spark commented on SPARK-37259: -- User 'akhalymon-cv' has created a pull request for this issue: https://github.com/apache/spark/pull/34709 > JDBC read is always going to wrap the query in a select statement > - > > Key: SPARK-37259 > URL: https://issues.apache.org/jira/browse/SPARK-37259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kevin Appel >Priority: Major > > The read jdbc is wrapping the query it sends to the database server inside a > select statement and there is no way to override this currently. > Initially I ran into this issue when trying to run a CTE query against SQL > server and it fails, the details of the failure is in these cases: > [https://github.com/microsoft/mssql-jdbc/issues/1340] > [https://github.com/microsoft/mssql-jdbc/issues/1657] > [https://github.com/microsoft/sql-spark-connector/issues/147] > https://issues.apache.org/jira/browse/SPARK-32825 > https://issues.apache.org/jira/browse/SPARK-34928 > I started to patch the code to get the query to run and ran into a few > different items, if there is a way to add these features to allow this code > path to run, this would be extremely helpful to running these type of edge > case queries. These are basic examples here the actual queries are much more > complex and would require significant time to rewrite. > Inside JDBCOptions.scala the query is being set to either, using the dbtable > this allows the query to be passed without modification > > {code:java} > name.trim > or > s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}" > {code} > > Inside JDBCRelation.scala this is going to try to get the schema for this > query, and this ends up running dialect.getSchemaQuery which is doing: > {code:java} > s"SELECT * FROM $table WHERE 1=0"{code} > Overriding the dialect here and initially just passing back the $table gets > passed here and to the next issue which is in the compute function in > JDBCRDD.scala > > {code:java} > val sqlText = s"SELECT $columnList FROM ${options.tableOrQuery} > $myTableSampleClause" + s" $myWhereClause $getGroupByClause $myLimitClause" > > {code} > > For these two queries, about a CTE query and using temp tables, finding out > the schema is difficult without actually running the query and for the temp > table if you run it in the schema check that will have the table now exist > and fail when it runs the actual query. > > The way I patched these is by doing these two items: > JDBCRDD.scala (compute) > > {code:java} > val runQueryAsIs = options.parameters.getOrElse("runQueryAsIs", > "false").toBoolean > val sqlText = if (runQueryAsIs) { > s"${options.tableOrQuery}" > } else { > s"SELECT $columnList FROM ${options.tableOrQuery} $myWhereClause" > } > {code} > JDBCRelation.scala (getSchema) > {code:java} > val useCustomSchema = jdbcOptions.parameters.getOrElse("useCustomSchema", > "false").toBoolean > if (useCustomSchema) { > val myCustomSchema = jdbcOptions.parameters.getOrElse("customSchema", > "").toString > val newSchema = CatalystSqlParser.parseTableSchema(myCustomSchema) > logInfo(s"Going to return the new $newSchema because useCustomSchema is > $useCustomSchema and passed in $myCustomSchema") > newSchema > } else { > val tableSchema = JDBCRDD.resolveTable(jdbcOptions) > jdbcOptions.customSchema match { > case Some(customSchema) => JdbcUtils.getCustomSchema( > tableSchema, customSchema, resolver) > case None => tableSchema > } > }{code} > > This is allowing the query to run as is, by using the dbtable option and then > provide a custom schema that will bypass the dialect schema check > > Test queries > > {code:java} > query1 = """ > SELECT 1 as DummyCOL > """ > query2 = """ > WITH DummyCTE AS > ( > SELECT 1 as DummyCOL > ) > SELECT * > FROM DummyCTE > """ > query3 = """ > (SELECT * > INTO #Temp1a > FROM > (SELECT @@VERSION as version) data > ) > (SELECT * > FROM > #Temp1a) > """ > {code} > > Test schema > > {code:java} > schema1 = """ > DummyXCOL INT > """ > schema2 = """ > DummyXCOL STRING > """ > {code} > > Test code > > {code:java} > jdbcDFWorking = ( > spark.read.format("jdbc") > .option("url", > f"jdbc:sqlserver://{server}:{port};databaseName={database};") > .option("user", user) > .option("password", password) > .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") > .option("dbtable", queryx) > .option("customSchema", schemax) > .option("useCustomSchema", "true") > .option("runQueryAsIs", "true") >
[jira] [Commented] (SPARK-37259) JDBC read is always going to wrap the query in a select statement
[ https://issues.apache.org/jira/browse/SPARK-37259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449067#comment-17449067 ] Andrew commented on SPARK-37259: I've created another PR following [~KevinAppelBofa] idea to unwrap the query and pass schema manually when the user chooses to do so. If let's say an option 'useRawQuery' passed, the query will run as-is. The downside of this user has to provide schema manually. However, there are also advantages, as we are not running the query twice, and the user doesn't have to modify the query and split the 'with' clause. PR link is https://github.com/apache/spark/pull/34709 [~KevinAppelBofa] [~petertoth] what are your thoughts on this? > JDBC read is always going to wrap the query in a select statement > - > > Key: SPARK-37259 > URL: https://issues.apache.org/jira/browse/SPARK-37259 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Kevin Appel >Priority: Major > > The read jdbc is wrapping the query it sends to the database server inside a > select statement and there is no way to override this currently. > Initially I ran into this issue when trying to run a CTE query against SQL > server and it fails, the details of the failure is in these cases: > [https://github.com/microsoft/mssql-jdbc/issues/1340] > [https://github.com/microsoft/mssql-jdbc/issues/1657] > [https://github.com/microsoft/sql-spark-connector/issues/147] > https://issues.apache.org/jira/browse/SPARK-32825 > https://issues.apache.org/jira/browse/SPARK-34928 > I started to patch the code to get the query to run and ran into a few > different items, if there is a way to add these features to allow this code > path to run, this would be extremely helpful to running these type of edge > case queries. These are basic examples here the actual queries are much more > complex and would require significant time to rewrite. > Inside JDBCOptions.scala the query is being set to either, using the dbtable > this allows the query to be passed without modification > > {code:java} > name.trim > or > s"(${subquery}) SPARK_GEN_SUBQ_${curId.getAndIncrement()}" > {code} > > Inside JDBCRelation.scala this is going to try to get the schema for this > query, and this ends up running dialect.getSchemaQuery which is doing: > {code:java} > s"SELECT * FROM $table WHERE 1=0"{code} > Overriding the dialect here and initially just passing back the $table gets > passed here and to the next issue which is in the compute function in > JDBCRDD.scala > > {code:java} > val sqlText = s"SELECT $columnList FROM ${options.tableOrQuery} > $myTableSampleClause" + s" $myWhereClause $getGroupByClause $myLimitClause" > > {code} > > For these two queries, about a CTE query and using temp tables, finding out > the schema is difficult without actually running the query and for the temp > table if you run it in the schema check that will have the table now exist > and fail when it runs the actual query. > > The way I patched these is by doing these two items: > JDBCRDD.scala (compute) > > {code:java} > val runQueryAsIs = options.parameters.getOrElse("runQueryAsIs", > "false").toBoolean > val sqlText = if (runQueryAsIs) { > s"${options.tableOrQuery}" > } else { > s"SELECT $columnList FROM ${options.tableOrQuery} $myWhereClause" > } > {code} > JDBCRelation.scala (getSchema) > {code:java} > val useCustomSchema = jdbcOptions.parameters.getOrElse("useCustomSchema", > "false").toBoolean > if (useCustomSchema) { > val myCustomSchema = jdbcOptions.parameters.getOrElse("customSchema", > "").toString > val newSchema = CatalystSqlParser.parseTableSchema(myCustomSchema) > logInfo(s"Going to return the new $newSchema because useCustomSchema is > $useCustomSchema and passed in $myCustomSchema") > newSchema > } else { > val tableSchema = JDBCRDD.resolveTable(jdbcOptions) > jdbcOptions.customSchema match { > case Some(customSchema) => JdbcUtils.getCustomSchema( > tableSchema, customSchema, resolver) > case None => tableSchema > } > }{code} > > This is allowing the query to run as is, by using the dbtable option and then > provide a custom schema that will bypass the dialect schema check > > Test queries > > {code:java} > query1 = """ > SELECT 1 as DummyCOL > """ > query2 = """ > WITH DummyCTE AS > ( > SELECT 1 as DummyCOL > ) > SELECT * > FROM DummyCTE > """ > query3 = """ > (SELECT * > INTO #Temp1a > FROM > (SELECT @@VERSION as version) data > ) > (SELECT * > FROM > #Temp1a) > """ > {code} > > Test schema > > {code:java} > schema1 = """ > DummyXCOL INT > """ > schema2 = """ > DummyXCOL STRING > """ > {code} > > Test code > > {code:java} >
[jira] [Commented] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
[ https://issues.apache.org/jira/browse/SPARK-37460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449049#comment-17449049 ] Yuto Akutsu commented on SPARK-37460: - I'm working on this. > ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented > - > > Key: SPARK-37460 > URL: https://issues.apache.org/jira/browse/SPARK-37460 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Yuto Akutsu >Priority: Minor > > The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ... > {color:#172b4d}command{color}{color} should be documented in a > sql-ref-syntax-ddl-create-table page. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37357) Add small partition factor for rebalance partitions
[ https://issues.apache.org/jira/browse/SPARK-37357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37357. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34634 [https://github.com/apache/spark/pull/34634] > Add small partition factor for rebalance partitions > --- > > Key: SPARK-37357 > URL: https://issues.apache.org/jira/browse/SPARK-37357 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Priority: Major > Fix For: 3.3.0 > > > For example `Rebalance` provide a functionality that split the large reduce > partition into smalls. However we have seen many SQL produce small files due > to the last partition. > Let's say we have one reduce partition and six map partitions and the blocks > are: > [10, 10, 10, 10, 10, 10] > If the target size is 50, we will get two files with 50 and 10. And it will > get worse if there are thousands of reduce partitions. > It should be helpful if we can control the min partition size. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37357) Add small partition factor for rebalance partitions
[ https://issues.apache.org/jira/browse/SPARK-37357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37357: --- Assignee: XiDuo You > Add small partition factor for rebalance partitions > --- > > Key: SPARK-37357 > URL: https://issues.apache.org/jira/browse/SPARK-37357 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Fix For: 3.3.0 > > > For example `Rebalance` provide a functionality that split the large reduce > partition into smalls. However we have seen many SQL produce small files due > to the last partition. > Let's say we have one reduce partition and six map partitions and the blocks > are: > [10, 10, 10, 10, 10, 10] > If the target size is 50, we will get two files with 50 and 10. And it will > get worse if there are thousands of reduce partitions. > It should be helpful if we can control the min partition size. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37311) Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17449025#comment-17449025 ] Apache Spark commented on SPARK-37311: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/34708 > Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default > - > > Key: SPARK-37311 > URL: https://issues.apache.org/jira/browse/SPARK-37311 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Priority: Major > > Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37311) Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37311: Assignee: Apache Spark > Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default > - > > Key: SPARK-37311 > URL: https://issues.apache.org/jira/browse/SPARK-37311 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Major > > Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37311) Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default
[ https://issues.apache.org/jira/browse/SPARK-37311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-37311: Assignee: (was: Apache Spark) > Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default > - > > Key: SPARK-37311 > URL: https://issues.apache.org/jira/browse/SPARK-37311 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Priority: Major > > Migrate ALTER NAMESPACE ... SET LOCATION to use v2 command by default -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
[ https://issues.apache.org/jira/browse/SPARK-37460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuto Akutsu updated SPARK-37460: Description: The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ... {color:#172b4d}command{color}{color} should be documented in a sql-ref-syntax-ddl-create-table page. (was: The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ...{color:#172b4d} command{color}{color}{color:#172b4d} {color}should be documented in a sql-ref-syntax-ddl-create-table page.) > ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented > - > > Key: SPARK-37460 > URL: https://issues.apache.org/jira/browse/SPARK-37460 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Yuto Akutsu >Priority: Minor > > The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ... > {color:#172b4d}command{color}{color} should be documented in a > sql-ref-syntax-ddl-create-table page. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
[ https://issues.apache.org/jira/browse/SPARK-37460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuto Akutsu updated SPARK-37460: Description: The instruction of {color:#FF}ALTER DATABASE ... SET LOCATION ... {color:#172b4d}command{color}{color} should be documented in a sql-ref-syntax-ddl-create-table page. (was: The instruction of `ALTER DATABASE ... SET LOCATION ...` should be documented in a `sql-ref-syntax-ddl-create-table` page.) > ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented > - > > Key: SPARK-37460 > URL: https://issues.apache.org/jira/browse/SPARK-37460 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Yuto Akutsu >Priority: Minor > > The instruction of {color:#FF}ALTER DATABASE ... SET LOCATION ... > {color:#172b4d}command{color}{color} should be documented in a > sql-ref-syntax-ddl-create-table page. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
[ https://issues.apache.org/jira/browse/SPARK-37460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuto Akutsu updated SPARK-37460: Description: The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION ...{color:#172b4d} command{color}{color}{color:#172b4d} {color}should be documented in a sql-ref-syntax-ddl-create-table page. (was: The instruction of {color:#FF}ALTER DATABASE ... SET LOCATION ... {color:#172b4d}command{color}{color} should be documented in a sql-ref-syntax-ddl-create-table page.) > ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented > - > > Key: SPARK-37460 > URL: https://issues.apache.org/jira/browse/SPARK-37460 > Project: Spark > Issue Type: Improvement > Components: Documentation, SQL >Affects Versions: 3.2.0 >Reporter: Yuto Akutsu >Priority: Minor > > The instruction of {color:#ff}ALTER DATABASE ... SET LOCATION > ...{color:#172b4d} command{color}{color}{color:#172b4d} {color}should be > documented in a sql-ref-syntax-ddl-create-table page. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-37460) ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented
Yuto Akutsu created SPARK-37460: --- Summary: ALTER (DATABASE|SCHEMA|NAMESPACE) ... SET LOCATION command not documented Key: SPARK-37460 URL: https://issues.apache.org/jira/browse/SPARK-37460 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 3.2.0 Reporter: Yuto Akutsu The instruction of `ALTER DATABASE ... SET LOCATION ...` should be documented in a `sql-ref-syntax-ddl-create-table` page. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-37454) support expressions in time travel timestamp
[ https://issues.apache.org/jira/browse/SPARK-37454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-37454. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 34699 [https://github.com/apache/spark/pull/34699] > support expressions in time travel timestamp > > > Key: SPARK-37454 > URL: https://issues.apache.org/jira/browse/SPARK-37454 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-37454) support expressions in time travel timestamp
[ https://issues.apache.org/jira/browse/SPARK-37454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-37454: --- Assignee: Wenchen Fan > support expressions in time travel timestamp > > > Key: SPARK-37454 > URL: https://issues.apache.org/jira/browse/SPARK-37454 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org