[jira] [Updated] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-06 Thread Xianghao Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghao Lu updated SPARK-35332:

Description: 
*How to reproduce the problem*

_linux shell command to prepare data:_
 for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
data.text

_sql to reproduce the problem:_
 * create table data_table(id int, str string, num int) row format delimited 
fields terminated by ',';
 * load data local inpath '/path/to/data.text' into table data_table;
 * CACHE TABLE test_cache_table AS
 SELECT str
 FROM
 (SELECT id,str FROM data_table
 )group by str;

Finally you will see a stage with 200 tasks and not coalesce shuffle 
partitions, the problem will waste resource when data size is small.

  was:
*How to reproduce the problem*

_linux shell command to prepare data:_
 for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
data.text

_sql to reproduce the problem:_
 * create table data_table(id int, str string, num int) row format delimited 
fields terminated by ',';
 * load data local inpath '/path/to/data.text' into table data_table;
 * CACHE TABLE test_cache_table AS
 SELECT str
 FROM
 (SELECT id,str FROM data_table
 )group by str;

Finally you will see a stage with 200 tasks and not coalesce shuffle 
partitions, this will waste resource when data size is small.


> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, the problem will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-06 Thread Xianghao Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghao Lu updated SPARK-35332:

Description: 
*How to reproduce the problem*

_linux shell command to prepare data_
 for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
data.text

_sql to reproduce the problem_
 * create table data_table(id int, str string, num int) row format delimited 
fields terminated by ',';
 * load data local inpath '/path/to/data.text' into table data_table;
 * CACHE TABLE test_cache_table AS
 SELECT str
 FROM
 (SELECT id,str FROM data_table
 )group by str;

Finally you will see a stage with 200 tasks and not coalesce shuffle 
partitions, this will waste resource when data size is small.

  was:
How to reproduce the problem

prepare data
for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
data.text

sql to reproduce the problem
* create table data_table(id int, str string, num int) row format delimited 
fields terminated by ',';
* load data local inpath '/path/to/data.text' into table data_table;
* CACHE TABLE test_cache_table AS
SELECT str
FROM
  (SELECT id,str FROM data_table
 )group by str;

Finally you will see a stage with 200 tasks and not coalesce shuffle 
partitions, this will waste resource when data size is small.



> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, this will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-06 Thread Xianghao Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghao Lu updated SPARK-35332:

Description: 
*How to reproduce the problem*

_linux shell command to prepare data:_
 for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
data.text

_sql to reproduce the problem:_
 * create table data_table(id int, str string, num int) row format delimited 
fields terminated by ',';
 * load data local inpath '/path/to/data.text' into table data_table;
 * CACHE TABLE test_cache_table AS
 SELECT str
 FROM
 (SELECT id,str FROM data_table
 )group by str;

Finally you will see a stage with 200 tasks and not coalesce shuffle 
partitions, this will waste resource when data size is small.

  was:
*How to reproduce the problem*

_linux shell command to prepare data_
 for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
data.text

_sql to reproduce the problem_
 * create table data_table(id int, str string, num int) row format delimited 
fields terminated by ',';
 * load data local inpath '/path/to/data.text' into table data_table;
 * CACHE TABLE test_cache_table AS
 SELECT str
 FROM
 (SELECT id,str FROM data_table
 )group by str;

Finally you will see a stage with 200 tasks and not coalesce shuffle 
partitions, this will waste resource when data size is small.


> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: cacheTable.png
>
>
> *How to reproduce the problem*
> _linux shell command to prepare data:_
>  for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> _sql to reproduce the problem:_
>  * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
>  * load data local inpath '/path/to/data.text' into table data_table;
>  * CACHE TABLE test_cache_table AS
>  SELECT str
>  FROM
>  (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, this will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-06 Thread Xianghao Lu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xianghao Lu updated SPARK-35332:

Attachment: cacheTable.png

> Not Coalesce shuffle partitions when cache table
> 
>
> Key: SPARK-35332
> URL: https://issues.apache.org/jira/browse/SPARK-35332
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.1, 3.1.0, 3.1.1
> Environment: latest spark version
>Reporter: Xianghao Lu
>Priority: Major
> Attachments: cacheTable.png
>
>
> How to reproduce the problem
> prepare data
> for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
> data.text
> sql to reproduce the problem
> * create table data_table(id int, str string, num int) row format delimited 
> fields terminated by ',';
> * load data local inpath '/path/to/data.text' into table data_table;
> * CACHE TABLE test_cache_table AS
> SELECT str
> FROM
>   (SELECT id,str FROM data_table
>  )group by str;
> Finally you will see a stage with 200 tasks and not coalesce shuffle 
> partitions, this will waste resource when data size is small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35332) Not Coalesce shuffle partitions when cache table

2021-05-06 Thread Xianghao Lu (Jira)
Xianghao Lu created SPARK-35332:
---

 Summary: Not Coalesce shuffle partitions when cache table
 Key: SPARK-35332
 URL: https://issues.apache.org/jira/browse/SPARK-35332
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.1.1, 3.1.0, 3.0.1
 Environment: latest spark version
Reporter: Xianghao Lu


How to reproduce the problem

prepare data
for i in $(seq 20);do echo "$(($i+10)),name$i,$(($i*10))";done > 
data.text

sql to reproduce the problem
* create table data_table(id int, str string, num int) row format delimited 
fields terminated by ',';
* load data local inpath '/path/to/data.text' into table data_table;
* CACHE TABLE test_cache_table AS
SELECT str
FROM
  (SELECT id,str FROM data_table
 )group by str;

Finally you will see a stage with 200 tasks and not coalesce shuffle 
partitions, this will waste resource when data size is small.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35331) Attributes become unknown in RepartitionByExpression after aliased

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35331:


Assignee: Apache Spark

> Attributes become unknown in RepartitionByExpression after aliased
> --
>
> Key: SPARK-35331
> URL: https://issues.apache.org/jira/browse/SPARK-35331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.1, 3.2.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> explain extended select a b from values(1) t(a) distribute by a;
> == Parsed Logical Plan ==
> 'RepartitionByExpression ['a]
> +- 'Project ['a AS b#42]
>+- 'SubqueryAlias t
>   +- 'UnresolvedInlineTable [a], [List(1)]
> == Analyzed Logical Plan ==
> org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input 
> columns: [b]; line 1 pos 62;
> 'RepartitionByExpression ['a]
> +- Project [a#48 AS b#42]
>+- SubqueryAlias t
>   +- LocalRelation [a#48]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35331) Attributes become unknown in RepartitionByExpression after aliased

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340547#comment-17340547
 ] 

Apache Spark commented on SPARK-35331:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/32465

> Attributes become unknown in RepartitionByExpression after aliased
> --
>
> Key: SPARK-35331
> URL: https://issues.apache.org/jira/browse/SPARK-35331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.1, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> explain extended select a b from values(1) t(a) distribute by a;
> == Parsed Logical Plan ==
> 'RepartitionByExpression ['a]
> +- 'Project ['a AS b#42]
>+- 'SubqueryAlias t
>   +- 'UnresolvedInlineTable [a], [List(1)]
> == Analyzed Logical Plan ==
> org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input 
> columns: [b]; line 1 pos 62;
> 'RepartitionByExpression ['a]
> +- Project [a#48 AS b#42]
>+- SubqueryAlias t
>   +- LocalRelation [a#48]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35331) Attributes become unknown in RepartitionByExpression after aliased

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35331:


Assignee: (was: Apache Spark)

> Attributes become unknown in RepartitionByExpression after aliased
> --
>
> Key: SPARK-35331
> URL: https://issues.apache.org/jira/browse/SPARK-35331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.1, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> explain extended select a b from values(1) t(a) distribute by a;
> == Parsed Logical Plan ==
> 'RepartitionByExpression ['a]
> +- 'Project ['a AS b#42]
>+- 'SubqueryAlias t
>   +- 'UnresolvedInlineTable [a], [List(1)]
> == Analyzed Logical Plan ==
> org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input 
> columns: [b]; line 1 pos 62;
> 'RepartitionByExpression ['a]
> +- Project [a#48 AS b#42]
>+- SubqueryAlias t
>   +- LocalRelation [a#48]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-35062) Group exception messages in sql/streaming

2021-05-06 Thread jiaan.geng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-35062:
---
Comment: was deleted

(was: I'm working on.)

> Group exception messages in sql/streaming
> -
>
> Key: SPARK-35062
> URL: https://issues.apache.org/jira/browse/SPARK-35062
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming'
> || Filename||   Count ||
> | DataStreamReader.scala  |   2 |
> | DataStreamWriter.scala  |   9 |
> | StreamingQueryManager.scala |   1 |
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming/ui'
> || Filename   ||   Count ||
> | StreamingQueryPage.scala   |   1 |
> | StreamingQueryStatisticsPage.scala |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35331) Attributes become unknown in RepartitionByExpression after aliased

2021-05-06 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-35331:
-
Summary: Attributes become unknown in RepartitionByExpression after aliased 
 (was: Attributes become unknown for  RepartitionByExpression after aliased)

> Attributes become unknown in RepartitionByExpression after aliased
> --
>
> Key: SPARK-35331
> URL: https://issues.apache.org/jira/browse/SPARK-35331
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.2, 3.1.1, 3.2.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> explain extended select a b from values(1) t(a) distribute by a;
> == Parsed Logical Plan ==
> 'RepartitionByExpression ['a]
> +- 'Project ['a AS b#42]
>+- 'SubqueryAlias t
>   +- 'UnresolvedInlineTable [a], [List(1)]
> == Analyzed Logical Plan ==
> org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input 
> columns: [b]; line 1 pos 62;
> 'RepartitionByExpression ['a]
> +- Project [a#48 AS b#42]
>+- SubqueryAlias t
>   +- LocalRelation [a#48]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35331) Attributes become unknown for RepartitionByExpression after aliased

2021-05-06 Thread Kent Yao (Jira)
Kent Yao created SPARK-35331:


 Summary: Attributes become unknown for  RepartitionByExpression 
after aliased
 Key: SPARK-35331
 URL: https://issues.apache.org/jira/browse/SPARK-35331
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.0.2, 2.4.8, 3.2.0
Reporter: Kent Yao



{code:java}
explain extended select a b from values(1) t(a) distribute by a;
== Parsed Logical Plan ==
'RepartitionByExpression ['a]
+- 'Project ['a AS b#42]
   +- 'SubqueryAlias t
  +- 'UnresolvedInlineTable [a], [List(1)]

== Analyzed Logical Plan ==
org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns: 
[b]; line 1 pos 62;
'RepartitionByExpression ['a]
+- Project [a#48 AS b#42]
   +- SubqueryAlias t
  +- LocalRelation [a#48]
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35062) Group exception messages in sql/streaming

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35062:


Assignee: Apache Spark

> Group exception messages in sql/streaming
> -
>
> Key: SPARK-35062
> URL: https://issues.apache.org/jira/browse/SPARK-35062
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming'
> || Filename||   Count ||
> | DataStreamReader.scala  |   2 |
> | DataStreamWriter.scala  |   9 |
> | StreamingQueryManager.scala |   1 |
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming/ui'
> || Filename   ||   Count ||
> | StreamingQueryPage.scala   |   1 |
> | StreamingQueryStatisticsPage.scala |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35062) Group exception messages in sql/streaming

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340545#comment-17340545
 ] 

Apache Spark commented on SPARK-35062:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/32464

> Group exception messages in sql/streaming
> -
>
> Key: SPARK-35062
> URL: https://issues.apache.org/jira/browse/SPARK-35062
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming'
> || Filename||   Count ||
> | DataStreamReader.scala  |   2 |
> | DataStreamWriter.scala  |   9 |
> | StreamingQueryManager.scala |   1 |
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming/ui'
> || Filename   ||   Count ||
> | StreamingQueryPage.scala   |   1 |
> | StreamingQueryStatisticsPage.scala |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35062) Group exception messages in sql/streaming

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340546#comment-17340546
 ] 

Apache Spark commented on SPARK-35062:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/32464

> Group exception messages in sql/streaming
> -
>
> Key: SPARK-35062
> URL: https://issues.apache.org/jira/browse/SPARK-35062
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Assignee: Apache Spark
>Priority: Major
>
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming'
> || Filename||   Count ||
> | DataStreamReader.scala  |   2 |
> | DataStreamWriter.scala  |   9 |
> | StreamingQueryManager.scala |   1 |
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming/ui'
> || Filename   ||   Count ||
> | StreamingQueryPage.scala   |   1 |
> | StreamingQueryStatisticsPage.scala |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35062) Group exception messages in sql/streaming

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35062:


Assignee: (was: Apache Spark)

> Group exception messages in sql/streaming
> -
>
> Key: SPARK-35062
> URL: https://issues.apache.org/jira/browse/SPARK-35062
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming'
> || Filename||   Count ||
> | DataStreamReader.scala  |   2 |
> | DataStreamWriter.scala  |   9 |
> | StreamingQueryManager.scala |   1 |
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming/ui'
> || Filename   ||   Count ||
> | StreamingQueryPage.scala   |   1 |
> | StreamingQueryStatisticsPage.scala |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35133) EXPLAIN CODEGEN does not work with AQE

2021-05-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35133.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32430
[https://github.com/apache/spark/pull/32430]

> EXPLAIN CODEGEN does not work with AQE
> --
>
> Key: SPARK-35133
> URL: https://issues.apache.org/jira/browse/SPARK-35133
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Major
> Fix For: 3.2.0
>
>
> `EXPLAIN CODEGEN ` (and Dataset.explain("codegen")) prints out the 
> generated code for each stage of plan. The current implementation is to match 
> `WholeStageCodegenExec` operator in query plan and prints out generated code 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala#L111-L118]
>  ). This does not work with AQE as we wrap the whole query plan inside 
> `AdaptiveSparkPlanExec` and do not run whole stage code-gen physical plan 
> rule eagerly (`CollapseCodegenStages`). This introduces unexpected behavior 
> change for EXPLAIN query (and Dataset.explain), as we enable AQE by default 
> now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35133) EXPLAIN CODEGEN does not work with AQE

2021-05-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35133:
-

Assignee: Cheng Su

> EXPLAIN CODEGEN does not work with AQE
> --
>
> Key: SPARK-35133
> URL: https://issues.apache.org/jira/browse/SPARK-35133
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Major
>
> `EXPLAIN CODEGEN ` (and Dataset.explain("codegen")) prints out the 
> generated code for each stage of plan. The current implementation is to match 
> `WholeStageCodegenExec` operator in query plan and prints out generated code 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala#L111-L118]
>  ). This does not work with AQE as we wrap the whole query plan inside 
> `AdaptiveSparkPlanExec` and do not run whole stage code-gen physical plan 
> rule eagerly (`CollapseCodegenStages`). This introduces unexpected behavior 
> change for EXPLAIN query (and Dataset.explain), as we enable AQE by default 
> now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35062) Group exception messages in sql/streaming

2021-05-06 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340528#comment-17340528
 ] 

jiaan.geng commented on SPARK-35062:


I'm working on.

> Group exception messages in sql/streaming
> -
>
> Key: SPARK-35062
> URL: https://issues.apache.org/jira/browse/SPARK-35062
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming'
> || Filename||   Count ||
> | DataStreamReader.scala  |   2 |
> | DataStreamWriter.scala  |   9 |
> | StreamingQueryManager.scala |   1 |
> 'sql/core/src/main/scala/org/apache/spark/sql/streaming/ui'
> || Filename   ||   Count ||
> | StreamingQueryPage.scala   |   1 |
> | StreamingQueryStatisticsPage.scala |   1 |



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35306) Add benchmark results for BLASBenchmark created by Github Actions machines

2021-05-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35306.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32435
[https://github.com/apache/spark/pull/32435]

> Add benchmark results for BLASBenchmark created by Github Actions machines
> --
>
> Key: SPARK-35306
> URL: https://issues.apache.org/jira/browse/SPARK-35306
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Byungsoo Oh
>Assignee: Byungsoo Oh
>Priority: Minor
> Fix For: 3.2.0
>
>
> In SPARK-34950, benchmark results were updated to the ones created by Github 
> Actions machines.
>  
> The goal of this Jira issue is to add benchmark results for BLASBenchmark 
> (added at SPARK-33882 and SPARK-35150).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35306) Add benchmark results for BLASBenchmark created by Github Actions machines

2021-05-06 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35306:


Assignee: Byungsoo Oh

> Add benchmark results for BLASBenchmark created by Github Actions machines
> --
>
> Key: SPARK-35306
> URL: https://issues.apache.org/jira/browse/SPARK-35306
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Byungsoo Oh
>Assignee: Byungsoo Oh
>Priority: Minor
>
> In SPARK-34950, benchmark results were updated to the ones created by Github 
> Actions machines.
>  
> The goal of this Jira issue is to add benchmark results for BLASBenchmark 
> (added at SPARK-33882 and SPARK-35150).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35147) Migrate to resolveWithPruning for two command rules

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340512#comment-17340512
 ] 

Apache Spark commented on SPARK-35147:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32463

> Migrate to resolveWithPruning for two command rules
> ---
>
> Key: SPARK-35147
> URL: https://issues.apache.org/jira/browse/SPARK-35147
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> We can add one TreePattern called "COMMAND". Then, two rules to be migrated:
>  * ResolvePartitionSpec
>  * ResolveCommandsWithIfExists 
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35147) Migrate to resolveWithPruning for two command rules

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35147:


Assignee: Apache Spark

> Migrate to resolveWithPruning for two command rules
> ---
>
> Key: SPARK-35147
> URL: https://issues.apache.org/jira/browse/SPARK-35147
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Apache Spark
>Priority: Major
>
> We can add one TreePattern called "COMMAND". Then, two rules to be migrated:
>  * ResolvePartitionSpec
>  * ResolveCommandsWithIfExists 
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35147) Migrate to resolveWithPruning for two command rules

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340511#comment-17340511
 ] 

Apache Spark commented on SPARK-35147:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32463

> Migrate to resolveWithPruning for two command rules
> ---
>
> Key: SPARK-35147
> URL: https://issues.apache.org/jira/browse/SPARK-35147
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> We can add one TreePattern called "COMMAND". Then, two rules to be migrated:
>  * ResolvePartitionSpec
>  * ResolveCommandsWithIfExists 
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35147) Migrate to resolveWithPruning for two command rules

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35147:


Assignee: (was: Apache Spark)

> Migrate to resolveWithPruning for two command rules
> ---
>
> Key: SPARK-35147
> URL: https://issues.apache.org/jira/browse/SPARK-35147
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> We can add one TreePattern called "COMMAND". Then, two rules to be migrated:
>  * ResolvePartitionSpec
>  * ResolveCommandsWithIfExists 
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35293) Use the newer dsdgen for TPCDSQueryTestSuite

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340504#comment-17340504
 ] 

Apache Spark commented on SPARK-35293:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32462

> Use the newer dsdgen for TPCDSQueryTestSuite
> 
>
> Key: SPARK-35293
> URL: https://issues.apache.org/jira/browse/SPARK-35293
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> This PR intends to replace `maropu/spark-tpcds-datagen` with 
> `databricks/tpcds-kit` for using a newer dsdgen and update the golden files 
> in `tpcds-query-results`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35293) Use the newer dsdgen for TPCDSQueryTestSuite

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340502#comment-17340502
 ] 

Apache Spark commented on SPARK-35293:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32462

> Use the newer dsdgen for TPCDSQueryTestSuite
> 
>
> Key: SPARK-35293
> URL: https://issues.apache.org/jira/browse/SPARK-35293
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> This PR intends to replace `maropu/spark-tpcds-datagen` with 
> `databricks/tpcds-kit` for using a newer dsdgen and update the golden files 
> in `tpcds-query-results`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35293) Use the newer dsdgen for TPCDSQueryTestSuite

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340503#comment-17340503
 ] 

Apache Spark commented on SPARK-35293:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32462

> Use the newer dsdgen for TPCDSQueryTestSuite
> 
>
> Key: SPARK-35293
> URL: https://issues.apache.org/jira/browse/SPARK-35293
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> This PR intends to replace `maropu/spark-tpcds-datagen` with 
> `databricks/tpcds-kit` for using a newer dsdgen and update the golden files 
> in `tpcds-query-results`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35192) Port minimal TPC-DS datagen code from databricks/spark-sql-perf

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340501#comment-17340501
 ] 

Apache Spark commented on SPARK-35192:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32462

> Port minimal TPC-DS datagen code from databricks/spark-sql-perf
> ---
>
> Key: SPARK-35192
> URL: https://issues.apache.org/jira/browse/SPARK-35192
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.2.0
>
>
> This PR aims at porting minimal code to generate TPC-DS data from 
> databricks/spark-sql-perf. The classes in a new class file tpcdsDatagen.scala 
> are basically copied from the databricks/spark-sql-perf codebase.
> We frequently use TPCDS data now for benchmarks/tests, but the classes for 
> the TPCDS schemas of datagen and benchmarks/tests are managed separately, 
> e.g.,
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
> https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala
> I think this causes some inconveniences, e.g., we need to update both files 
> in the separate repositories if we update the TPCDS schema #32037. So, it 
> would be useful for the Spark codebase to generate them by referring to the 
> same schema definition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35192) Port minimal TPC-DS datagen code from databricks/spark-sql-perf

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340500#comment-17340500
 ] 

Apache Spark commented on SPARK-35192:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32462

> Port minimal TPC-DS datagen code from databricks/spark-sql-perf
> ---
>
> Key: SPARK-35192
> URL: https://issues.apache.org/jira/browse/SPARK-35192
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 3.2.0
>
>
> This PR aims at porting minimal code to generate TPC-DS data from 
> databricks/spark-sql-perf. The classes in a new class file tpcdsDatagen.scala 
> are basically copied from the databricks/spark-sql-perf codebase.
> We frequently use TPCDS data now for benchmarks/tests, but the classes for 
> the TPCDS schemas of datagen and benchmarks/tests are managed separately, 
> e.g.,
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
> https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala
> I think this causes some inconveniences, e.g., we need to update both files 
> in the separate repositories if we update the TPCDS schema #32037. So, it 
> would be useful for the Spark codebase to generate them by referring to the 
> same schema definition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340498#comment-17340498
 ] 

Apache Spark commented on SPARK-34795:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32462

> Adds a new job in GitHub Actions to check the output of TPC-DS queries
> --
>
> Key: SPARK-34795
> URL: https://issues.apache.org/jira/browse/SPARK-34795
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.2.0
>
>
> This ticket aims at adding a new job in GitHub Actions to check the output of 
> TPC-DS queries. There are some cases where we noticed runtime-realted bugs 
> after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
> adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34226) Reduce RepartitionOperation num partitions to its child max row

2021-05-06 Thread ulysses you (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ulysses you resolved SPARK-34226.
-
Resolution: Won't Fix

> Reduce RepartitionOperation num partitions to its child max row
> ---
>
> Key: SPARK-34226
> URL: https://issues.apache.org/jira/browse/SPARK-34226
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Priority: Minor
>
> It's no meaning to repartition data if partition number is larger than data 
> row, but would waste the resouce due to redundant task.
> With ETL case, we always inject `repartition` or `distribute by` to reduce 
> the output partition but the partition number may bigger than data row. It's 
> better that try our best to reduce the redundant partition.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35327) Filters out the TPC-DS queries that can cause flaky test results

2021-05-06 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-35327:
-
Summary: Filters out the TPC-DS queries that can cause flaky test results  
(was: Merge similar v1.4/v2.7 TPCDS queries)

> Filters out the TPC-DS queries that can cause flaky test results
> 
>
> Key: SPARK-35327
> URL: https://issues.apache.org/jira/browse/SPARK-35327
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at merging similar 
> v1.4(`resources/tpcds`)/v2.7(`resources/tpcds-v2.7.0`) TPCDS queries; it 
> copies 13 query files (q6,q11,q12,q20,q24,q34,q47,q57,q64,q74,q75,q78,q98) 
> from`resources/tpcds-v2.7.0` to `resources/tpcds`, and then remove the files 
> in `resources/tpcds-v2.7.0`.
> I saw`TPCDSQueryTestSuite` failed nondeterministically because output row 
> orders were different with those in the golden files. For example, the 
> failure in the GA job, 
> https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true,
>  happened because the `tpcds/q6.sql` query output rows were only sorted by 
> `cnt`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds/q6.sql#L20
> Actually, `tpcds/q6.sql`  and `tpcds-v2.7.0/q6.sql` are almost the same and 
> the only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and 
> `a.ca_state`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql#L22
> So, I think it's okay just to use `tpcds-v2.7.0/q6.sql` for stable testing in 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35327) Filters out the TPC-DS queries that can cause flaky test results

2021-05-06 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-35327:
-
Description: 
This ticket aims at filtering out TPCDS v1.4 q6 and q75 in 
`TPCDSQueryTestSuite`.

I saw`TPCDSQueryTestSuite` failed nondeterministically because output row 
orders were different with those in the golden files. For example, the failure 
in the GA job, 
https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true, 
happened because the `tpcds/q6.sql` query output rows were only sorted by `cnt`:

https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds/q6.sql#L20
Actually, `tpcds/q6.sql` and `tpcds-v2.7.0/q6.sql` are almost the same and the 
only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and `a.ca_state`:
https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql#L22
So, I think it's okay just to test `tpcds-v2.7.0/q6.sql` in this case (q75 has 
the same issue).

  was:
This ticket aims at merging similar 
v1.4(`resources/tpcds`)/v2.7(`resources/tpcds-v2.7.0`) TPCDS queries; it copies 
13 query files (q6,q11,q12,q20,q24,q34,q47,q57,q64,q74,q75,q78,q98) 
from`resources/tpcds-v2.7.0` to `resources/tpcds`, and then remove the files in 
`resources/tpcds-v2.7.0`.

I saw`TPCDSQueryTestSuite` failed nondeterministically because output row 
orders were different with those in the golden files. For example, the failure 
in the GA job, 
https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true, 
happened because the `tpcds/q6.sql` query output rows were only sorted by `cnt`:

https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds/q6.sql#L20
Actually, `tpcds/q6.sql`  and `tpcds-v2.7.0/q6.sql` are almost the same and the 
only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and `a.ca_state`:
https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql#L22
So, I think it's okay just to use `tpcds-v2.7.0/q6.sql` for stable testing in 
this case.



> Filters out the TPC-DS queries that can cause flaky test results
> 
>
> Key: SPARK-35327
> URL: https://issues.apache.org/jira/browse/SPARK-35327
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at filtering out TPCDS v1.4 q6 and q75 in 
> `TPCDSQueryTestSuite`.
> I saw`TPCDSQueryTestSuite` failed nondeterministically because output row 
> orders were different with those in the golden files. For example, the 
> failure in the GA job, 
> https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true,
>  happened because the `tpcds/q6.sql` query output rows were only sorted by 
> `cnt`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds/q6.sql#L20
> Actually, `tpcds/q6.sql` and `tpcds-v2.7.0/q6.sql` are almost the same and 
> the only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and 
> `a.ca_state`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql#L22
> So, I think it's okay just to test `tpcds-v2.7.0/q6.sql` in this case (q75 
> has the same issue).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35147) Migrate to resolveWithPruning for two command rules

2021-05-06 Thread Yingyi Bu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340491#comment-17340491
 ] 

Yingyi Bu commented on SPARK-35147:
---

I'm WIP on this issue.

> Migrate to resolveWithPruning for two command rules
> ---
>
> Key: SPARK-35147
> URL: https://issues.apache.org/jira/browse/SPARK-35147
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> We can add one TreePattern called "COMMAND". Then, two rules to be migrated:
>  * ResolvePartitionSpec
>  * ResolveCommandsWithIfExists 
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35146) Migrate to transformWithPruning or resolveWithPruning for rules in finishAnalysis

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35146:


Assignee: (was: Apache Spark)

> Migrate to transformWithPruning or resolveWithPruning for rules in 
> finishAnalysis
> -
>
> Key: SPARK-35146
> URL: https://issues.apache.org/jira/browse/SPARK-35146
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Rules in org/apache/spark/sql/catalyst/optimizer/finishAnalysis.Scala, except 
> RewriteNonCorrelatedExists which was done in SPARK-35075.
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35146) Migrate to transformWithPruning or resolveWithPruning for rules in finishAnalysis

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340490#comment-17340490
 ] 

Apache Spark commented on SPARK-35146:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32461

> Migrate to transformWithPruning or resolveWithPruning for rules in 
> finishAnalysis
> -
>
> Key: SPARK-35146
> URL: https://issues.apache.org/jira/browse/SPARK-35146
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Rules in org/apache/spark/sql/catalyst/optimizer/finishAnalysis.Scala, except 
> RewriteNonCorrelatedExists which was done in SPARK-35075.
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35146) Migrate to transformWithPruning or resolveWithPruning for rules in finishAnalysis

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35146:


Assignee: Apache Spark

> Migrate to transformWithPruning or resolveWithPruning for rules in 
> finishAnalysis
> -
>
> Key: SPARK-35146
> URL: https://issues.apache.org/jira/browse/SPARK-35146
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Apache Spark
>Priority: Major
>
> Rules in org/apache/spark/sql/catalyst/optimizer/finishAnalysis.Scala, except 
> RewriteNonCorrelatedExists which was done in SPARK-35075.
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35146) Migrate to transformWithPruning or resolveWithPruning for rules in finishAnalysis

2021-05-06 Thread Yingyi Bu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340489#comment-17340489
 ] 

Yingyi Bu commented on SPARK-35146:
---

I'm WIP on this.

> Migrate to transformWithPruning or resolveWithPruning for rules in 
> finishAnalysis
> -
>
> Key: SPARK-35146
> URL: https://issues.apache.org/jira/browse/SPARK-35146
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Rules in org/apache/spark/sql/catalyst/optimizer/finishAnalysis.Scala, except 
> RewriteNonCorrelatedExists which was done in SPARK-35075.
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35253) Upgrade Janino from 3.0.16 to 3.1.4

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340486#comment-17340486
 ] 

Apache Spark commented on SPARK-35253:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32455

> Upgrade Janino from 3.0.16 to 3.1.4
> ---
>
> Key: SPARK-35253
> URL: https://issues.apache.org/jira/browse/SPARK-35253
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> From the [change log|http://janino-compiler.github.io/janino/changelog.html], 
>  the janino 3.0.x line has been deprecated,  we can use 3.1.x line instead of 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35253) Upgrade Janino from 3.0.16 to 3.1.4

2021-05-06 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-35253:
-
Summary: Upgrade Janino from 3.0.16 to 3.1.4  (was: Upgrade Janino from 
3.0.x to 3.1.x)

> Upgrade Janino from 3.0.16 to 3.1.4
> ---
>
> Key: SPARK-35253
> URL: https://issues.apache.org/jira/browse/SPARK-35253
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> From the [change log|http://janino-compiler.github.io/janino/changelog.html], 
>  the janino 3.0.x line has been deprecated,  we can use 3.1.x line instead of 
> it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35133) EXPLAIN CODEGEN does not work with AQE

2021-05-06 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340476#comment-17340476
 ] 

Dongjoon Hyun commented on SPARK-35133:
---

Thank you, [~chengsu] and all.
I collected this issue as a subtask of SPARK-33828 to give more visibility.

> EXPLAIN CODEGEN does not work with AQE
> --
>
> Key: SPARK-35133
> URL: https://issues.apache.org/jira/browse/SPARK-35133
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Major
>
> `EXPLAIN CODEGEN ` (and Dataset.explain("codegen")) prints out the 
> generated code for each stage of plan. The current implementation is to match 
> `WholeStageCodegenExec` operator in query plan and prints out generated code 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala#L111-L118]
>  ). This does not work with AQE as we wrap the whole query plan inside 
> `AdaptiveSparkPlanExec` and do not run whole stage code-gen physical plan 
> rule eagerly (`CollapseCodegenStages`). This introduces unexpected behavior 
> change for EXPLAIN query (and Dataset.explain), as we enable AQE by default 
> now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35133) EXPLAIN CODEGEN does not work with AQE

2021-05-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-35133:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Bug)

> EXPLAIN CODEGEN does not work with AQE
> --
>
> Key: SPARK-35133
> URL: https://issues.apache.org/jira/browse/SPARK-35133
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Major
>
> `EXPLAIN CODEGEN ` (and Dataset.explain("codegen")) prints out the 
> generated code for each stage of plan. The current implementation is to match 
> `WholeStageCodegenExec` operator in query plan and prints out generated code 
> ([https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala#L111-L118]
>  ). This does not work with AQE as we wrap the whole query plan inside 
> `AdaptiveSparkPlanExec` and do not run whole stage code-gen physical plan 
> rule eagerly (`CollapseCodegenStages`). This introduces unexpected behavior 
> change for EXPLAIN query (and Dataset.explain), as we enable AQE by default 
> now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35293) Use the newer dsdgen for TPCDSQueryTestSuite

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340455#comment-17340455
 ] 

Apache Spark commented on SPARK-35293:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32460

> Use the newer dsdgen for TPCDSQueryTestSuite
> 
>
> Key: SPARK-35293
> URL: https://issues.apache.org/jira/browse/SPARK-35293
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> This PR intends to replace `maropu/spark-tpcds-datagen` with 
> `databricks/tpcds-kit` for using a newer dsdgen and update the golden files 
> in `tpcds-query-results`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30466) remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13

2021-05-06 Thread Steven (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340428#comment-17340428
 ] 

Steven commented on SPARK-30466:


Is it possible this could be added to Spark 3.2?  Like the original reporter 
([~mburgener]), these two libraries are flagged by our vulnerability scanners 
as having the following security vulnerabilities.  This causes extra overhead 
at each GA, for something which is deprecated and replaced anyway.   Thank you.

> remove dependency on jackson-mapper-asl-1.9.13 and jackson-core-asl-1.9.13
> --
>
> Key: SPARK-30466
> URL: https://issues.apache.org/jira/browse/SPARK-30466
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Michael Burgener
>Priority: Major
>  Labels: security
>
> These 2 libraries are deprecated and replaced by the jackson-databind 
> libraries which are already included.  These two libraries are flagged by our 
> vulnerability scanners as having the following security vulnerabilities.  
> I've set the priority to Major due to the Critical nature and hopefully they 
> can be addressed quickly.  Please note, I'm not a developer but work in 
> InfoSec and this was flagged when we incorporated spark into our product.  If 
> you feel the priority is not set correctly please change accordingly.  I'll 
> watch the issue and flag our dev team to update once resolved.  
> jackson-mapper-asl-1.9.13
> CVE-2018-7489 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2018-7489] 
>  
> CVE-2017-7525 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-7525]
>  
> CVE-2017-17485 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-17485]
>  
> CVE-2017-15095 (CVSS 3.0 Score 9.8 CRITICAL)
> [https://nvd.nist.gov/vuln/detail/CVE-2017-15095]
>  
> CVE-2018-5968 (CVSS 3.0 Score 8.1 High)
> [https://nvd.nist.gov/vuln/detail/CVE-2018-5968]
>  
> jackson-core-asl-1.9.13
> CVE-2016-7051 (CVSS 3.0 Score 8.6 High)
> https://nvd.nist.gov/vuln/detail/CVE-2016-7051



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340296#comment-17340296
 ] 

Apache Spark commented on SPARK-26164:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32459

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35326) Upgrade Jersey to 2.34

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340270#comment-17340270
 ] 

Apache Spark commented on SPARK-35326:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32458

> Upgrade Jersey to 2.34
> --
>
> Key: SPARK-35326
> URL: https://issues.apache.org/jira/browse/SPARK-35326
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> CVE-2021-28168, a local information disclosure vulnerability, is reported.
> Spark 3.1.1, 3.0.2 and 3.2.0 use an affected version 2.30.
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28168



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35326) Upgrade Jersey to 2.34

2021-05-06 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35326.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32453
[https://github.com/apache/spark/pull/32453]

> Upgrade Jersey to 2.34
> --
>
> Key: SPARK-35326
> URL: https://issues.apache.org/jira/browse/SPARK-35326
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> CVE-2021-28168, a local information disclosure vulnerability, is reported.
> Spark 3.1.1, 3.0.2 and 3.2.0 use an affected version 2.30.
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28168



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35330) Tune shuffle requests frequency dynamically in the case of Netty OOM

2021-05-06 Thread wuyi (Jira)
wuyi created SPARK-35330:


 Summary: Tune shuffle requests frequency dynamically in the case 
of Netty OOM
 Key: SPARK-35330
 URL: https://issues.apache.org/jira/browse/SPARK-35330
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: wuyi


In [https://github.com/apache/spark/pull/32287,] the PR proposes to use a flag 
to indicate the Netty OOM status and defer the fetch requests when OOM happens. 
However, it doesn't change the request frequency after OOM happens. Thus, it's 
possible to hit the OOM again later if the deferred requests are still too much 
to fetch concurrently. Therefore, tuning the fetch frequency might be a good 
idea to solve the issue. Please see the detailed discussion at 
https://github.com/apache/spark/pull/32287#discussion_r625287419 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35329) Split generated switch code into pieces in ExpandExec

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35329:


Assignee: (was: Apache Spark)

> Split generated switch code into pieces in ExpandExec
> -
>
> Key: SPARK-35329
> URL: https://issues.apache.org/jira/browse/SPARK-35329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket aim at splitting generated switch code into smaller ones in 
> `ExpandExec`. In the current master, even a simple query like the one below 
> generates a large method whose size (`maxMethodCodeSize:7448`) is close to 
> `8000` (`CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT`);
> {code:java}
> scala> val df = Seq(("2016-03-27 19:39:34", 1, "a"), ("2016-03-27 19:39:56", 
> 2, "a"), ("2016-03-27 19:39:27", 4, "b")).toDF("time", "value", "id")
> scala> val rdf = df.select(window($"time", "10 seconds", "3 seconds", "0 
> second"), $"value").orderBy($"window.start".asc, 
> $"value".desc).select("value")
> scala> sql("SET spark.sql.adaptive.enabled=false")
> scala> import org.apache.spark.sql.execution.debug._ 
> scala> rdf.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 (maxMethodCodeSize:7448; maxConstantPoolSize:189(0.29% 
> used); numInnerClasses:0) ==
> 
> *(1) Project [window#34.start AS _gen_alias_39#39, value#11]
> +- *(1) Filter ((isnotnull(window#34) AND (cast(time#10 as timestamp) >= 
> window#34.start)) AND (cast(time#10 as timestamp) < window#34.end))
>+- *(1) Expand [List(named_struct(start, precisetimestampcon...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35329) Split generated switch code into pieces in ExpandExec

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35329:


Assignee: Apache Spark

> Split generated switch code into pieces in ExpandExec
> -
>
> Key: SPARK-35329
> URL: https://issues.apache.org/jira/browse/SPARK-35329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> This ticket aim at splitting generated switch code into smaller ones in 
> `ExpandExec`. In the current master, even a simple query like the one below 
> generates a large method whose size (`maxMethodCodeSize:7448`) is close to 
> `8000` (`CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT`);
> {code:java}
> scala> val df = Seq(("2016-03-27 19:39:34", 1, "a"), ("2016-03-27 19:39:56", 
> 2, "a"), ("2016-03-27 19:39:27", 4, "b")).toDF("time", "value", "id")
> scala> val rdf = df.select(window($"time", "10 seconds", "3 seconds", "0 
> second"), $"value").orderBy($"window.start".asc, 
> $"value".desc).select("value")
> scala> sql("SET spark.sql.adaptive.enabled=false")
> scala> import org.apache.spark.sql.execution.debug._ 
> scala> rdf.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 (maxMethodCodeSize:7448; maxConstantPoolSize:189(0.29% 
> used); numInnerClasses:0) ==
> 
> *(1) Project [window#34.start AS _gen_alias_39#39, value#11]
> +- *(1) Filter ((isnotnull(window#34) AND (cast(time#10 as timestamp) >= 
> window#34.start)) AND (cast(time#10 as timestamp) < window#34.end))
>+- *(1) Expand [List(named_struct(start, precisetimestampcon...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35329) Split generated switch code into pieces in ExpandExec

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340256#comment-17340256
 ] 

Apache Spark commented on SPARK-35329:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32457

> Split generated switch code into pieces in ExpandExec
> -
>
> Key: SPARK-35329
> URL: https://issues.apache.org/jira/browse/SPARK-35329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This ticket aim at splitting generated switch code into smaller ones in 
> `ExpandExec`. In the current master, even a simple query like the one below 
> generates a large method whose size (`maxMethodCodeSize:7448`) is close to 
> `8000` (`CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT`);
> {code:java}
> scala> val df = Seq(("2016-03-27 19:39:34", 1, "a"), ("2016-03-27 19:39:56", 
> 2, "a"), ("2016-03-27 19:39:27", 4, "b")).toDF("time", "value", "id")
> scala> val rdf = df.select(window($"time", "10 seconds", "3 seconds", "0 
> second"), $"value").orderBy($"window.start".asc, 
> $"value".desc).select("value")
> scala> sql("SET spark.sql.adaptive.enabled=false")
> scala> import org.apache.spark.sql.execution.debug._ 
> scala> rdf.debugCodegen
> Found 2 WholeStageCodegen subtrees.
> == Subtree 1 / 2 (maxMethodCodeSize:7448; maxConstantPoolSize:189(0.29% 
> used); numInnerClasses:0) ==
> 
> *(1) Project [window#34.start AS _gen_alias_39#39, value#11]
> +- *(1) Filter ((isnotnull(window#34) AND (cast(time#10 as timestamp) >= 
> window#34.start)) AND (cast(time#10 as timestamp) < window#34.end))
>+- *(1) Expand [List(named_struct(start, precisetimestampcon...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35329) Split generated switch code into pieces in ExpandExec

2021-05-06 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-35329:


 Summary: Split generated switch code into pieces in ExpandExec
 Key: SPARK-35329
 URL: https://issues.apache.org/jira/browse/SPARK-35329
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Takeshi Yamamuro


This ticket aim at splitting generated switch code into smaller ones in 
`ExpandExec`. In the current master, even a simple query like the one below 
generates a large method whose size (`maxMethodCodeSize:7448`) is close to 
`8000` (`CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT`);
{code:java}
scala> val df = Seq(("2016-03-27 19:39:34", 1, "a"), ("2016-03-27 19:39:56", 2, 
"a"), ("2016-03-27 19:39:27", 4, "b")).toDF("time", "value", "id")
scala> val rdf = df.select(window($"time", "10 seconds", "3 seconds", "0 
second"), $"value").orderBy($"window.start".asc, $"value".desc).select("value")
scala> sql("SET spark.sql.adaptive.enabled=false")
scala> import org.apache.spark.sql.execution.debug._ 
scala> rdf.debugCodegen

Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxMethodCodeSize:7448; maxConstantPoolSize:189(0.29% used); 
numInnerClasses:0) ==

*(1) Project [window#34.start AS _gen_alias_39#39, value#11]
+- *(1) Filter ((isnotnull(window#34) AND (cast(time#10 as timestamp) >= 
window#34.start)) AND (cast(time#10 as timestamp) < window#34.end))
   +- *(1) Expand [List(named_struct(start, precisetimestampcon...
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35328) Use 'SPARK_LOG_URL_' as env prefix for getting driver log urls by default

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340236#comment-17340236
 ] 

Apache Spark commented on SPARK-35328:
--

User 'sharkdtu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32456

> Use 'SPARK_LOG_URL_' as env prefix for getting driver log urls by default
> -
>
> Key: SPARK-35328
> URL: https://issues.apache.org/jira/browse/SPARK-35328
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: sharkd tu
>Priority: Major
>
> Currently, spark on kubernetes can't show logs url on ui. To check history 
> logs, we usually collect pod logs to third-party logging services, which can 
> be accessed by urls. To show log urls,  we can set env prefixed with 
> 'SPARK_LOG_URL_' for executors. But for driver, there is no way to show log 
> urls by set env. 
>  
> I will create a new pr that use 'SPARK_LOG_URL_' as env prefix for getting 
> driver log urls by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35328) Use 'SPARK_LOG_URL_' as env prefix for getting driver log urls by default

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340238#comment-17340238
 ] 

Apache Spark commented on SPARK-35328:
--

User 'sharkdtu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32456

> Use 'SPARK_LOG_URL_' as env prefix for getting driver log urls by default
> -
>
> Key: SPARK-35328
> URL: https://issues.apache.org/jira/browse/SPARK-35328
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: sharkd tu
>Priority: Major
>
> Currently, spark on kubernetes can't show logs url on ui. To check history 
> logs, we usually collect pod logs to third-party logging services, which can 
> be accessed by urls. To show log urls,  we can set env prefixed with 
> 'SPARK_LOG_URL_' for executors. But for driver, there is no way to show log 
> urls by set env. 
>  
> I will create a new pr that use 'SPARK_LOG_URL_' as env prefix for getting 
> driver log urls by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35328) Use 'SPARK_LOG_URL_' as env prefix for getting driver log urls by default

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35328:


Assignee: (was: Apache Spark)

> Use 'SPARK_LOG_URL_' as env prefix for getting driver log urls by default
> -
>
> Key: SPARK-35328
> URL: https://issues.apache.org/jira/browse/SPARK-35328
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: sharkd tu
>Priority: Major
>
> Currently, spark on kubernetes can't show logs url on ui. To check history 
> logs, we usually collect pod logs to third-party logging services, which can 
> be accessed by urls. To show log urls,  we can set env prefixed with 
> 'SPARK_LOG_URL_' for executors. But for driver, there is no way to show log 
> urls by set env. 
>  
> I will create a new pr that use 'SPARK_LOG_URL_' as env prefix for getting 
> driver log urls by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35328) Use 'SPARK_LOG_URL_' as env prefix for getting driver log urls by default

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35328:


Assignee: Apache Spark

> Use 'SPARK_LOG_URL_' as env prefix for getting driver log urls by default
> -
>
> Key: SPARK-35328
> URL: https://issues.apache.org/jira/browse/SPARK-35328
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: sharkd tu
>Assignee: Apache Spark
>Priority: Major
>
> Currently, spark on kubernetes can't show logs url on ui. To check history 
> logs, we usually collect pod logs to third-party logging services, which can 
> be accessed by urls. To show log urls,  we can set env prefixed with 
> 'SPARK_LOG_URL_' for executors. But for driver, there is no way to show log 
> urls by set env. 
>  
> I will create a new pr that use 'SPARK_LOG_URL_' as env prefix for getting 
> driver log urls by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35328) Use 'SPARK_LOG_URL_' as env prefix for getting driver log urls by default

2021-05-06 Thread sharkd tu (Jira)
sharkd tu created SPARK-35328:
-

 Summary: Use 'SPARK_LOG_URL_' as env prefix for getting driver log 
urls by default
 Key: SPARK-35328
 URL: https://issues.apache.org/jira/browse/SPARK-35328
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.1.1
Reporter: sharkd tu


Currently, spark on kubernetes can't show logs url on ui. To check history 
logs, we usually collect pod logs to third-party logging services, which can be 
accessed by urls. To show log urls,  we can set env prefixed with 
'SPARK_LOG_URL_' for executors. But for driver, there is no way to show log 
urls by set env. 

 

I will create a new pr that use 'SPARK_LOG_URL_' as env prefix for getting 
driver log urls by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35327) Merge similar v1.4/v2.7 TPCDS queries

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35327:


Assignee: (was: Apache Spark)

> Merge similar v1.4/v2.7 TPCDS queries
> -
>
> Key: SPARK-35327
> URL: https://issues.apache.org/jira/browse/SPARK-35327
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at merging similar 
> v1.4(`resources/tpcds`)/v2.7(`resources/tpcds-v2.7.0`) TPCDS queries; it 
> copies 13 query files (q6,q11,q12,q20,q24,q34,q47,q57,q64,q74,q75,q78,q98) 
> from`resources/tpcds-v2.7.0` to `resources/tpcds`, and then remove the files 
> in `resources/tpcds-v2.7.0`.
> I saw`TPCDSQueryTestSuite` failed nondeterministically because output row 
> orders were different with those in the golden files. For example, the 
> failure in the GA job, 
> https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true,
>  happened because the `tpcds/q6.sql` query output rows were only sorted by 
> `cnt`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds/q6.sql#L20
> Actually, `tpcds/q6.sql`  and `tpcds-v2.7.0/q6.sql` are almost the same and 
> the only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and 
> `a.ca_state`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql#L22
> So, I think it's okay just to use `tpcds-v2.7.0/q6.sql` for stable testing in 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35327) Merge similar v1.4/v2.7 TPCDS queries

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35327:


Assignee: Apache Spark

> Merge similar v1.4/v2.7 TPCDS queries
> -
>
> Key: SPARK-35327
> URL: https://issues.apache.org/jira/browse/SPARK-35327
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Major
>
> This ticket aims at merging similar 
> v1.4(`resources/tpcds`)/v2.7(`resources/tpcds-v2.7.0`) TPCDS queries; it 
> copies 13 query files (q6,q11,q12,q20,q24,q34,q47,q57,q64,q74,q75,q78,q98) 
> from`resources/tpcds-v2.7.0` to `resources/tpcds`, and then remove the files 
> in `resources/tpcds-v2.7.0`.
> I saw`TPCDSQueryTestSuite` failed nondeterministically because output row 
> orders were different with those in the golden files. For example, the 
> failure in the GA job, 
> https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true,
>  happened because the `tpcds/q6.sql` query output rows were only sorted by 
> `cnt`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds/q6.sql#L20
> Actually, `tpcds/q6.sql`  and `tpcds-v2.7.0/q6.sql` are almost the same and 
> the only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and 
> `a.ca_state`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql#L22
> So, I think it's okay just to use `tpcds-v2.7.0/q6.sql` for stable testing in 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35327) Merge similar v1.4/v2.7 TPCDS queries

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340227#comment-17340227
 ] 

Apache Spark commented on SPARK-35327:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32454

> Merge similar v1.4/v2.7 TPCDS queries
> -
>
> Key: SPARK-35327
> URL: https://issues.apache.org/jira/browse/SPARK-35327
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at merging similar 
> v1.4(`resources/tpcds`)/v2.7(`resources/tpcds-v2.7.0`) TPCDS queries; it 
> copies 13 query files (q6,q11,q12,q20,q24,q34,q47,q57,q64,q74,q75,q78,q98) 
> from`resources/tpcds-v2.7.0` to `resources/tpcds`, and then remove the files 
> in `resources/tpcds-v2.7.0`.
> I saw`TPCDSQueryTestSuite` failed nondeterministically because output row 
> orders were different with those in the golden files. For example, the 
> failure in the GA job, 
> https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true,
>  happened because the `tpcds/q6.sql` query output rows were only sorted by 
> `cnt`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds/q6.sql#L20
> Actually, `tpcds/q6.sql`  and `tpcds-v2.7.0/q6.sql` are almost the same and 
> the only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and 
> `a.ca_state`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql#L22
> So, I think it's okay just to use `tpcds-v2.7.0/q6.sql` for stable testing in 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35327) Merge similar v1.4/v2.7 TPCDS queries

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35327?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340226#comment-17340226
 ] 

Apache Spark commented on SPARK-35327:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32454

> Merge similar v1.4/v2.7 TPCDS queries
> -
>
> Key: SPARK-35327
> URL: https://issues.apache.org/jira/browse/SPARK-35327
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at merging similar 
> v1.4(`resources/tpcds`)/v2.7(`resources/tpcds-v2.7.0`) TPCDS queries; it 
> copies 13 query files (q6,q11,q12,q20,q24,q34,q47,q57,q64,q74,q75,q78,q98) 
> from`resources/tpcds-v2.7.0` to `resources/tpcds`, and then remove the files 
> in `resources/tpcds-v2.7.0`.
> I saw`TPCDSQueryTestSuite` failed nondeterministically because output row 
> orders were different with those in the golden files. For example, the 
> failure in the GA job, 
> https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true,
>  happened because the `tpcds/q6.sql` query output rows were only sorted by 
> `cnt`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds/q6.sql#L20
> Actually, `tpcds/q6.sql`  and `tpcds-v2.7.0/q6.sql` are almost the same and 
> the only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and 
> `a.ca_state`:
> https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql#L22
> So, I think it's okay just to use `tpcds-v2.7.0/q6.sql` for stable testing in 
> this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35327) Merge similar v1.4/v2.7 TPCDS queries

2021-05-06 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-35327:


 Summary: Merge similar v1.4/v2.7 TPCDS queries
 Key: SPARK-35327
 URL: https://issues.apache.org/jira/browse/SPARK-35327
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.1.1, 3.0.2, 3.2.0
Reporter: Takeshi Yamamuro


This ticket aims at merging similar 
v1.4(`resources/tpcds`)/v2.7(`resources/tpcds-v2.7.0`) TPCDS queries; it copies 
13 query files (q6,q11,q12,q20,q24,q34,q47,q57,q64,q74,q75,q78,q98) 
from`resources/tpcds-v2.7.0` to `resources/tpcds`, and then remove the files in 
`resources/tpcds-v2.7.0`.

I saw`TPCDSQueryTestSuite` failed nondeterministically because output row 
orders were different with those in the golden files. For example, the failure 
in the GA job, 
https://github.com/linhongliu-db/spark/runs/2507928605?check_suite_focus=true, 
happened because the `tpcds/q6.sql` query output rows were only sorted by `cnt`:

https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds/q6.sql#L20
Actually, `tpcds/q6.sql`  and `tpcds-v2.7.0/q6.sql` are almost the same and the 
only difference is that `tpcds-v2.7.0/q6.sql` sorts both `cnt` and `a.ca_state`:
https://github.com/apache/spark/blob/a0c76a8755a148e2bd774edcda12fe20f2f38c75/sql/core/src/test/resources/tpcds-v2.7.0/q6.sql#L22
So, I think it's okay just to use `tpcds-v2.7.0/q6.sql` for stable testing in 
this case.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34526) Skip checking glob path in FileStreamSink.hasMetadata

2021-05-06 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-34526.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31638
[https://github.com/apache/spark/pull/31638]

> Skip checking glob path in FileStreamSink.hasMetadata
> -
>
> Key: SPARK-34526
> URL: https://issues.apache.org/jira/browse/SPARK-34526
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.2.0
>
>
> When checking the path in {{FileStreamSink.hasMetadata}}, we should ignore 
> the error and assume the user wants to read a batch output. This is to keep 
> the original behavior of ignoring the error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34526) Skip checking glob path in FileStreamSink.hasMetadata

2021-05-06 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-34526:


Assignee: Yuanjian Li

> Skip checking glob path in FileStreamSink.hasMetadata
> -
>
> Key: SPARK-34526
> URL: https://issues.apache.org/jira/browse/SPARK-34526
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> When checking the path in {{FileStreamSink.hasMetadata}}, we should ignore 
> the error and assume the user wants to read a batch output. This is to keep 
> the original behavior of ignoring the error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35215) Update custom metrics per certain rows

2021-05-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35215.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32330
[https://github.com/apache/spark/pull/32330]

> Update custom metrics per certain rows
> --
>
> Key: SPARK-35215
> URL: https://issues.apache.org/jira/browse/SPARK-35215
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> We should only update custom metrics per certain (e.g. 100) rows. It helps to 
> reduce performance impact on scan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35326) Upgrade Jersey to 2.34

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340197#comment-17340197
 ] 

Apache Spark commented on SPARK-35326:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32453

> Upgrade Jersey to 2.34
> --
>
> Key: SPARK-35326
> URL: https://issues.apache.org/jira/browse/SPARK-35326
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> CVE-2021-28168, a local information disclosure vulnerability, is reported.
> Spark 3.1.1, 3.0.2 and 3.2.0 use an affected version 2.30.
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28168



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35326) Upgrade Jersey to 2.34

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340196#comment-17340196
 ] 

Apache Spark commented on SPARK-35326:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/32453

> Upgrade Jersey to 2.34
> --
>
> Key: SPARK-35326
> URL: https://issues.apache.org/jira/browse/SPARK-35326
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> CVE-2021-28168, a local information disclosure vulnerability, is reported.
> Spark 3.1.1, 3.0.2 and 3.2.0 use an affected version 2.30.
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28168



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35326) Upgrade Jersey to 2.34

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35326:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Upgrade Jersey to 2.34
> --
>
> Key: SPARK-35326
> URL: https://issues.apache.org/jira/browse/SPARK-35326
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> CVE-2021-28168, a local information disclosure vulnerability, is reported.
> Spark 3.1.1, 3.0.2 and 3.2.0 use an affected version 2.30.
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28168



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35326) Upgrade Jersey to 2.34

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35326:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Upgrade Jersey to 2.34
> --
>
> Key: SPARK-35326
> URL: https://issues.apache.org/jira/browse/SPARK-35326
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.2, 3.1.1, 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> CVE-2021-28168, a local information disclosure vulnerability, is reported.
> Spark 3.1.1, 3.0.2 and 3.2.0 use an affected version 2.30.
> https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28168



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35326) Upgrade Jersey to 2.34

2021-05-06 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-35326:
--

 Summary: Upgrade Jersey to 2.34
 Key: SPARK-35326
 URL: https://issues.apache.org/jira/browse/SPARK-35326
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.1.1, 3.0.2, 3.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


CVE-2021-28168, a local information disclosure vulnerability, is reported.
Spark 3.1.1, 3.0.2 and 3.2.0 use an affected version 2.30.
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-28168



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35160) Spark application submitted despite failing to get Hive delegation token

2021-05-06 Thread Manu Zhang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Manu Zhang updated SPARK-35160:
---
Description: 
Currently, when running on YARN and failing to get Hive delegation token, a 
Spark SQL application will still be submitted. Eventually, the application will 
fail on connecting to Hive metastore without a valid delegation token. 

Is there any reason for this design ?

cc [~jerryshao] who originally implemented this in 
https://issues.apache.org/jira/browse/SPARK-14743

I'd propose to fail immediately like HadoopFSDelegationTokenProvider.

 

Update:

After [https://github.com/apache/spark/pull/23418], 
HadoopFSDelegationTokenProvider no longer fail on non fatal exception. However, 
the author changed the behavior just to keep it consistent with other 
providers. 

  was:
Currently, when running on YARN and failing to get Hive delegation token, a 
Spark SQL application will still be submitted. Eventually, the application will 
fail on connecting to Hive metastore without a valid delegation token. 

Is there any reason for this design ?

cc [~jerryshao] who originally implemented this in 
https://issues.apache.org/jira/browse/SPARK-14743

I'd propose to fail immediately like HadoopFSDelegationTokenProvider.


> Spark application submitted despite failing to get Hive delegation token
> 
>
> Key: SPARK-35160
> URL: https://issues.apache.org/jira/browse/SPARK-35160
> Project: Spark
>  Issue Type: Improvement
>  Components: Security
>Affects Versions: 3.1.1
>Reporter: Manu Zhang
>Priority: Major
>
> Currently, when running on YARN and failing to get Hive delegation token, a 
> Spark SQL application will still be submitted. Eventually, the application 
> will fail on connecting to Hive metastore without a valid delegation token. 
> Is there any reason for this design ?
> cc [~jerryshao] who originally implemented this in 
> https://issues.apache.org/jira/browse/SPARK-14743
> I'd propose to fail immediately like HadoopFSDelegationTokenProvider.
>  
> Update:
> After [https://github.com/apache/spark/pull/23418], 
> HadoopFSDelegationTokenProvider no longer fail on non fatal exception. 
> However, the author changed the behavior just to keep it consistent with 
> other providers. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35273) CombineFilters support non-deterministic expressions

2021-05-06 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-35273:

Description: 
For example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
id = 7 and rand() <= 0.01").explain("cost")
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Filter ((rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6) and isnotnull(id#0) AND ((id#0 = 7) , 
Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}


Another example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)")
spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
{code}


  was:
For example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
id = 7 and rand() <= 0.01").explain("cost")
{code}

Current:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
0.01))), Statistics(sizeInBytes=1.0 B)
+- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
   +- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}

Expected:
{noformat}
== Optimized Logical Plan ==
Filter (isnotnull(id#0) AND (NOT id#0 IN (1,3,6) AND ((id#0 = 7) AND 
(rand(-1485510186481201685) <= 0.01, Statistics(sizeInBytes=1.0 B)
+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
{noformat}


Another example:
{code:scala}
spark.sql("create table t1(id int) using parquet")
spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)")
spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
{code}



> CombineFilters support non-deterministic expressions
> 
>
> Key: SPARK-35273
> URL: https://issues.apache.org/jira/browse/SPARK-35273
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> For example:
> {code:scala}
> spark.sql("create table t1(id int) using parquet")
> spark.sql("select * from (select * from t1 where id not in (1, 3, 6)) t where 
> id = 7 and rand() <= 0.01").explain("cost")
> {code}
> Current:
> {noformat}
> == Optimized Logical Plan ==
> Filter (isnotnull(id#0) AND ((id#0 = 7) AND (rand(-639771619343876662) <= 
> 0.01))), Statistics(sizeInBytes=1.0 B)
> +- Filter NOT id#0 IN (1,3,6), Statistics(sizeInBytes=1.0 B)
>+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
> {noformat}
> Expected:
> {noformat}
> == Optimized Logical Plan ==
> Filter ((rand(-639771619343876662) <= 0.01))), Statistics(sizeInBytes=1.0 B)
> +- Filter NOT id#0 IN (1,3,6) and isnotnull(id#0) AND ((id#0 = 7) , 
> Statistics(sizeInBytes=1.0 B)
>+- Relation default.t1[id#0] parquet, Statistics(sizeInBytes=0.0 B)
> {noformat}
> Another example:
> {code:scala}
> spark.sql("create table t1(id int) using parquet")
> spark.sql("create view v1 as select * from t1 where id not in (1, 3, 6)")
> spark.sql("select * from v1 where id = 7 and rand() <= 0.01").explain("cost")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35243) Support columnar execution on ANSI interval types

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340119#comment-17340119
 ] 

Apache Spark commented on SPARK-35243:
--

User 'Peng-Lei' has created a pull request for this issue:
https://github.com/apache/spark/pull/32452

> Support columnar execution on ANSI interval types
> -
>
> Key: SPARK-35243
> URL: https://issues.apache.org/jira/browse/SPARK-35243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> See SPARK-30066 as reference implementation for CalendarIntervalType ()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35243) Support columnar execution on ANSI interval types

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35243:


Assignee: (was: Apache Spark)

> Support columnar execution on ANSI interval types
> -
>
> Key: SPARK-35243
> URL: https://issues.apache.org/jira/browse/SPARK-35243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> See SPARK-30066 as reference implementation for CalendarIntervalType ()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35243) Support columnar execution on ANSI interval types

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35243:


Assignee: Apache Spark

> Support columnar execution on ANSI interval types
> -
>
> Key: SPARK-35243
> URL: https://issues.apache.org/jira/browse/SPARK-35243
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> See SPARK-30066 as reference implementation for CalendarIntervalType ()



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32007) Spark Driver Supervise does not work reliably

2021-05-06 Thread Alexandre CLEMENT (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340103#comment-17340103
 ] 

Alexandre CLEMENT edited comment on SPARK-32007 at 5/6/21, 9:07 AM:


{{[~EveLiao] Here are the logs after I manually kill an EC2 instance of the 
cluster.}}

2021-05-06 08:52:23,420 Driver submitted 
org.apache.spark.deploy.worker.DriverWrapper
 2021-05-06 08:52:23,420 Launching driver driver-20210506085223-0195 on worker 
worker-20210506084657-10.0.16.127-42291
 2021-05-06 08:52:26,891 Registering app ml4ra-sopra2019 
 2021-05-06 08:52:26,891 Registered app ml4ra-sopra2019 with ID 
app-20210506085226-0192
 2021-05-06 08:52:27,813 Application app-20210506085226-0192 requested to set 
total executors to 0.
 2021-05-06 08:52:36,716 Application app-20210506085226-0192 requested to set 
total executors to 1.
 2021-05-06 08:52:36,716 Launching executor app-20210506085226-0192/0 on worker 
worker-20210506084657-10.0.16.125-45495
 2021-05-06 08:52:43,541 Application app-20210506085226-0192 requested to set 
total executors to 0.
 2021-05-06 08:52:44,648 Application app-20210506085226-0192 requested to set 
total executors to 2.
 2021-05-06 08:52:44,648 Launching executor app-20210506085226-0192/1 on worker 
worker-20210506084657-10.0.16.127-42291
 2021-05-06 08:52:45,654 Application app-20210506085226-0192 requested to set 
total executors to 3.
 2021-05-06 08:52:45,654 Launching executor app-20210506085226-0192/2 on worker 
worker-20210506084657-10.0.16.125-45495
 2021-05-06 08:52:46,660 Application app-20210506085226-0192 requested to set 
total executors to 4.
 2021-05-06 08:52:51,975 Application app-20210506085226-0192 requested to set 
total executors to 3.
 2021-05-06 08:52:53,783 Application app-20210506085226-0192 requested to set 
total executors to 2.
 2021-05-06 08:52:57,796 Application app-20210506085226-0192 requested to set 
total executors to 1.
 2021-05-06 08:52:59,503 Application app-20210506085226-0192 requested to set 
total executors to 0.
 2021-05-06 08:53:24,922 Received unregister request from application 
app-20210506085226-0192
 2021-05-06 08:53:24,922 Removing app app-20210506085226-0192
 2021-05-06 08:53:25,026 10.0.16.127:56846 got disassociated, removing it.
 2021-05-06 08:53:25,026 0b1cb263dc91:37659 got disassociated, removing it.
 2021-05-06 08:53:25,479 Removing driver: driver-20210506085223-0195
 2021-05-06 08:53:25,737 10.0.16.130:43694 got disassociated, removing it.
 2021-05-06 08:53:25,737 10.0.16.125:45495 got disassociated, removing it.
 2021-05-06 08:53:25,737 Removing worker 
worker-20210506084657-10.0.16.125-45495 on 10.0.16.125:45495
 2021-05-06 08:53:25,737 Telling app of lost worker: 
worker-20210506084657-10.0.16.125-45495
 2021-05-06 08:54:23,305 WARN master.Master: Removing 
worker-20210506084657-10.0.16.127-42291 because we got no heartbeat in 60 
seconds
 2021-05-06 08:54:23,305 Removing worker 
worker-20210506084657-10.0.16.127-42291 on 10.0.16.127:42291
 2021-05-06 08:54:23,305 Telling app of lost worker: 
worker-20210506084657-10.0.16.127-42291


was (Author: apclement):
{{[~EveLiao] Here are the logs after I manually kill an EC2 instance of the 
cluster.}}

{{ 1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:23,420 INFO 
master.Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:23,420 INFO 
master.Master: Launching driver driver-20210506085223-0195 on worker 
worker-20210506084657-10.0.16.127-42291}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:26,891 INFO 
master.Master: Registering app ml4ra-sopra2019 }}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:26,891 INFO 
master.Master: Registered app ml4ra-sopra2019 with ID app-20210506085226-0192}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:27,813 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 0.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:36,716 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 1.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:36,716 INFO 
master.Master: Launching executor app-20210506085226-0192/0 on worker 
worker-20210506084657-10.0.16.125-45495}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:43,541 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 0.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:44,648 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 2.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:44,648 INFO 
master.Master: Launching executor app-20210506085226-0192/1 on worker 
worker-20210506084657-10.0.16.127-42291}}

[jira] [Commented] (SPARK-32007) Spark Driver Supervise does not work reliably

2021-05-06 Thread Alexandre CLEMENT (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340103#comment-17340103
 ] 

Alexandre CLEMENT commented on SPARK-32007:
---

{{[~EveLiao] Here are the logs after I manually kill an EC2 instance of the 
cluster.}}

{{ 1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:23,420 INFO 
master.Master: Driver submitted org.apache.spark.deploy.worker.DriverWrapper}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:23,420 INFO 
master.Master: Launching driver driver-20210506085223-0195 on worker 
worker-20210506084657-10.0.16.127-42291}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:26,891 INFO 
master.Master: Registering app ml4ra-sopra2019 }}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:26,891 INFO 
master.Master: Registered app ml4ra-sopra2019 with ID app-20210506085226-0192}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:27,813 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 0.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:36,716 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 1.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:36,716 INFO 
master.Master: Launching executor app-20210506085226-0192/0 on worker 
worker-20210506084657-10.0.16.125-45495}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:43,541 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 0.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:44,648 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 2.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:44,648 INFO 
master.Master: Launching executor app-20210506085226-0192/1 on worker 
worker-20210506084657-10.0.16.127-42291}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:45,654 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 3.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:45,654 INFO 
master.Master: Launching executor app-20210506085226-0192/2 on worker 
worker-20210506084657-10.0.16.125-45495}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:46,660 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 4.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:51,975 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 3.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:53,783 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 2.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:57,796 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 1.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:52:59,503 INFO 
master.Master: Application app-20210506085226-0192 requested to set total 
executors to 0.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:53:24,922 INFO 
master.Master: Received unregister request from application 
app-20210506085226-0192}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:53:24,922 INFO 
master.Master: Removing app app-20210506085226-0192}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:53:25,026 INFO 
master.Master: 10.0.16.127:56846 got disassociated, removing it.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:53:25,026 INFO 
master.Master: 0b1cb263dc91:37659 got disassociated, removing it.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:53:25,479 INFO 
master.Master: Removing driver: driver-20210506085223-0195}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:53:25,737 INFO 
master.Master: 10.0.16.130:43694 got disassociated, removing it.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:53:25,737 INFO 
master.Master: 10.0.16.125:45495 got disassociated, removing it.}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:53:25,737 INFO 
master.Master: Removing worker worker-20210506084657-10.0.16.125-45495 on 
10.0.16.125:45495}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:53:25,737 INFO 
master.Master: Telling app of lost worker: 
worker-20210506084657-10.0.16.125-45495}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:54:23,305 WARN 
master.Master: Removing worker-20210506084657-10.0.16.127-42291 because we got 
no heartbeat in 60 seconds}}
{{spark_master.1.uf5h6wl5y4b0@spark-dev-master | 2021-05-06 08:54:23,305 INFO 
master.Master: Removing worker worker-20210506084657-10.0.16.127-42291 on 
10.0.16.127:42291}}

[jira] [Resolved] (SPARK-35240) Use CheckpointFileManager for checkpoint manipulation

2021-05-06 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-35240.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32361
[https://github.com/apache/spark/pull/32361]

> Use CheckpointFileManager for checkpoint manipulation
> -
>
> Key: SPARK-35240
> URL: https://issues.apache.org/jira/browse/SPARK-35240
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> `CheckpointFileManager` is designed to handle checkpoint file manipulation. 
> However, there are a few places exposing FileSystem from checkpoint 
> files/paths. We should use `CheckpointFileManager` to manipulate checkpoint 
> files. For example, we may want to have one storage system for checkpoint 
> file. If all checkpoint file manipulation is performed through 
> `CheckpointFileManager`, we can only implement `CheckpointFileManager` for 
> the storage system, and don't need to implement FileSystem API for it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35318) View internal properties should be hidden for describe table command

2021-05-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35318.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32441
[https://github.com/apache/spark/pull/32441]

> View internal properties should be hidden for describe table command
> 
>
> Key: SPARK-35318
> URL: https://issues.apache.org/jira/browse/SPARK-35318
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
> Fix For: 3.2.0
>
>
> when creating view, spark will save some internal properties as table 
> properties. But this should not be displayed for describe table command 
> because this should be transparent to the end user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35318) View internal properties should be hidden for describe table command

2021-05-06 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35318:
---

Assignee: Linhong Liu

> View internal properties should be hidden for describe table command
> 
>
> Key: SPARK-35318
> URL: https://issues.apache.org/jira/browse/SPARK-35318
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Linhong Liu
>Assignee: Linhong Liu
>Priority: Major
>
> when creating view, spark will save some internal properties as table 
> properties. But this should not be displayed for describe table command 
> because this should be transparent to the end user.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35144) Migrate to transformWithPruning or resolveWithPruning for object rules

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35144:


Assignee: (was: Apache Spark)

> Migrate to transformWithPruning or resolveWithPruning for object rules
> --
>
> Key: SPARK-35144
> URL: https://issues.apache.org/jira/browse/SPARK-35144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Rules in org/apache/spark/sql/catalyst/optimizer/objects.Scala
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35144) Migrate to transformWithPruning or resolveWithPruning for object rules

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340042#comment-17340042
 ] 

Apache Spark commented on SPARK-35144:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32451

> Migrate to transformWithPruning or resolveWithPruning for object rules
> --
>
> Key: SPARK-35144
> URL: https://issues.apache.org/jira/browse/SPARK-35144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Rules in org/apache/spark/sql/catalyst/optimizer/objects.Scala
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35144) Migrate to transformWithPruning or resolveWithPruning for object rules

2021-05-06 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35144:


Assignee: Apache Spark

> Migrate to transformWithPruning or resolveWithPruning for object rules
> --
>
> Key: SPARK-35144
> URL: https://issues.apache.org/jira/browse/SPARK-35144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Apache Spark
>Priority: Major
>
> Rules in org/apache/spark/sql/catalyst/optimizer/objects.Scala
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34204) When use input_file_name() func all column from file appeared in physical plan of query, not only projection.

2021-05-06 Thread Nick Hryhoriev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34204?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340038#comment-17340038
 ] 

Nick Hryhoriev commented on SPARK-34204:


I write every simple code to avoid it, it little bit hacky but still works for 
me.


{code:java}
implicit class EnrichWithFilePathAndModificationTime(df: DataFrame) {

  def withFilePath(fileColumn: String)(implicit spark: SparkSession): DataFrame 
= {
val existingFilesByPartition = df.rdd
  .partitions
  .map {
case partition: FilePartition =>
  assert(partition.files.length == 1) // Spark must be configured to 
read one file per partition.
  partition.index -> partition.files.head.filePath
  }.toMap

val partIdRowEncoder = RowEncoder.apply(
  df.schema
.add(fileColumn, StringType)
)
df.mapPartitions { it =>
  val sparkPartitionId = TaskContext.get().partitionId()
  val file = existingFilesByPartition(sparkPartitionId)
  it.map(r => Row.fromSeq(r.toSeq ++ Seq(file)))
}(partIdRowEncoder)
  }

}
{code}
This code snippet works only together with
{code:java}
// Do not change, our custom logic require only 1 file in spark partition
.set("spark.sql.files.openCostInBytes", Int.MaxValue.toString){code}
which may not suit use cases with a too big amount of small files.
Please advise if anyone knows a better way.



 

> When use input_file_name() func all column from file appeared in physical 
> plan of query, not only projection.
> -
>
> Key: SPARK-34204
> URL: https://issues.apache.org/jira/browse/SPARK-34204
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: Nick Hryhoriev
>Priority: Major
>
> input_file_name() function damage applying projection to the physical plan of 
> the query.
>  if use this function and a new column, column-oriented formats like parquet 
> and orc put all columns to Physical plan.
>  While without it, only selected columns uploaded.
>  In my case, performance influence is x30.
> {code:java}
> import org.apache.spark.sql.SparkSession
> import org.apache.spark.sql.functions._
> object TestSize {
>   def main(args: Array[String]): Unit = {
> implicit val spark: SparkSession = SparkSession.builder()
>   .master("local")
>   .config("spark.sql.shuffle.partitions", "5")
>   .getOrCreate()
> import spark.implicits._
> val query1 = spark.read.parquet(
>   "s3a://part-00040-a19f0d20-eab3-48ef-be5a-602c7f9a8e58.c000.gz.parquet"
> )
>   .select($"app_id", $"idfa", input_file_name().as("fileName"))
>   .distinct()
>   .count()
>val query2 = spark.read.parquet( 
> "s3a://part-00040-a19f0d20-eab3-48ef-be5a- 602c7f9a8e58.c000.gz.parquet" ) 
>   .select($"app_id", $"idfa")
>   .distinct() 
>   .count()
> Thread.sleep(100L)
>   }
> }
> {code}
> `query1` has all columns in the physical plan, while `query2` only two.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35144) Migrate to transformWithPruning or resolveWithPruning for object rules

2021-05-06 Thread Yingyi Bu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340035#comment-17340035
 ] 

Yingyi Bu commented on SPARK-35144:
---

I'm WIP on this issue.

> Migrate to transformWithPruning or resolveWithPruning for object rules
> --
>
> Key: SPARK-35144
> URL: https://issues.apache.org/jira/browse/SPARK-35144
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Priority: Major
>
> Rules in org/apache/spark/sql/catalyst/optimizer/objects.Scala
>  
> Commit 
> [https://github.com/apache/spark/commit/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631]
>  contains the framework level change and a few example rule changes.
>  
> Example patterns:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreePatterns.scala#L24-L32]
>  
> Example rule:
> [https://github.com/apache/spark/blob/3db8ec258c4a8438bda73c26fc7b1eb6f9d51631/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/CostBasedJoinReorder.scala]
>  
> [https://github.com/apache/spark/pull/32247] is another example



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35264) Support AQE side broadcastJoin threshold

2021-05-06 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340026#comment-17340026
 ] 

Dongjoon Hyun commented on SPARK-35264:
---

Thank YOU, [~ulysses]! :)

> Support AQE side broadcastJoin threshold
> 
>
> Key: SPARK-35264
> URL: https://issues.apache.org/jira/browse/SPARK-35264
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Major
> Fix For: 3.2.0
>
>
> The main idea here is that make join config isolation between normal planner 
> and aqe planner which shared the same code path.
> Actually we don not very trust using the static stat to consider if it can 
> build broadcast hash join. In our experience it's very common that Spark 
> throw broadcast timeout or driver side OOM exception when execute a bit large 
> plan. And due to braodcast join is not reversed which means if we covert join 
> to braodcast hash join at first time, we(AQE) can not optimize it again, so 
> it should make sense to decide if we can do broadcast at aqe side using 
> different sql config.
> In order to achieve this we use a specific join hint in advance during AQE 
> framework and then at JoinSelection side it will take and follow the inserted 
> hint.
> For now we only support select strategy for equi join, and follow this order
>  1. mark join as broadcast hash join if possible
>  2. mark join as shuffled hash join if possible
> Note that, we don't override join strategy if user specifies a join hint.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31325) Control a plan explain mode in the events of SQL listeners via SQLConf

2021-05-06 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31325:
-
Fix Version/s: 3.1.0

> Control a plan explain mode in the events of SQL listeners via SQLConf
> --
>
> Key: SPARK-31325
> URL: https://issues.apache.org/jira/browse/SPARK-31325
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.1.0
>
>
> This proposes to add a new SQL config for controlling a plan explain mode in 
> the events of (e.g., `SparkListenerSQLExecutionStart` and 
> `SparkListenerSQLAdaptiveExecutionUpdate`) SQL listeners.
> In the current master, the output of `QueryExecution.toString` (this is 
> equivalent to the "extended" explain mode) is stored in these events. I think 
> it is useful to control the content via SQLConf.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35293) Use the newer dsdgen for TPCDSQueryTestSuite

2021-05-06 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-35293.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32420

> Use the newer dsdgen for TPCDSQueryTestSuite
> 
>
> Key: SPARK-35293
> URL: https://issues.apache.org/jira/browse/SPARK-35293
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> This PR intends to replace `maropu/spark-tpcds-datagen` with 
> `databricks/tpcds-kit` for using a newer dsdgen and update the golden files 
> in `tpcds-query-results`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org