[jira] [Assigned] (SPARK-35159) extract doc of hive format

2021-04-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35159:
---

Assignee: angerszhu

> extract doc of hive format
> --
>
> Key: SPARK-35159
> URL: https://issues.apache.org/jira/browse/SPARK-35159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> extract doc of hive format



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35159) extract doc of hive format

2021-04-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35159.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32264
[https://github.com/apache/spark/pull/32264]

> extract doc of hive format
> --
>
> Key: SPARK-35159
> URL: https://issues.apache.org/jira/browse/SPARK-35159
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> extract doc of hive format



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35143) Add default log config for spark-sql

2021-04-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-35143:


Assignee: hong dongdong

> Add default log config for spark-sql
> 
>
> Key: SPARK-35143
> URL: https://issues.apache.org/jira/browse/SPARK-35143
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Minor
>
> The default log level for spark-sql is WARN. How to change the log level is 
> confusing, we need a default config.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35143) Add default log config for spark-sql

2021-04-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35143.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32248
[https://github.com/apache/spark/pull/32248]

> Add default log config for spark-sql
> 
>
> Key: SPARK-35143
> URL: https://issues.apache.org/jira/browse/SPARK-35143
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, SQL
>Affects Versions: 3.1.1
>Reporter: hong dongdong
>Assignee: hong dongdong
>Priority: Minor
> Fix For: 3.2.0
>
>
> The default log level for spark-sql is WARN. How to change the log level is 
> confusing, we need a default config.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35141) Support two level map for final hash aggregation

2021-04-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35141:
---

Assignee: Cheng Su

> Support two level map for final hash aggregation
> 
>
> Key: SPARK-35141
> URL: https://issues.apache.org/jira/browse/SPARK-35141
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
>
> For partial hash aggregation (code-gen path), we have two level of hash map 
> for aggregation. First level is from `RowBasedHashMapGenerator`, which is 
> computation faster compared to the second level from 
> `UnsafeFixedWidthAggregationMap`. The introducing of two level hash map can 
> help improve CPU performance of query as the first level hash map normally 
> fits in hardware cache and has cheaper hash function for key lookup.
> For final hash aggregation, we can also support two level of hash map, to 
> improve query performance further.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35141) Support two level map for final hash aggregation

2021-04-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35141.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32242
[https://github.com/apache/spark/pull/32242]

> Support two level map for final hash aggregation
> 
>
> Key: SPARK-35141
> URL: https://issues.apache.org/jira/browse/SPARK-35141
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> For partial hash aggregation (code-gen path), we have two level of hash map 
> for aggregation. First level is from `RowBasedHashMapGenerator`, which is 
> computation faster compared to the second level from 
> `UnsafeFixedWidthAggregationMap`. The introducing of two level hash map can 
> help improve CPU performance of query as the first level hash map normally 
> fits in hardware cache and has cheaper hash function for key lookup.
> For final hash aggregation, we can also support two level of hash map, to 
> improve query performance further.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35195) Move InMemoryTable etc to org.apache.spark.sql.connector.catalog

2021-04-22 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-35195:
---

Assignee: Chao Sun

> Move InMemoryTable etc to org.apache.spark.sql.connector.catalog
> 
>
> Key: SPARK-35195
> URL: https://issues.apache.org/jira/browse/SPARK-35195
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Trivial
>
> Currently test classes such as {{InMemoryTable}} reside in 
> {{org.apache.spark.sql.connector}} rather than 
> {{org.apache.spark.sql.connector.catalog}}. We should move them to latter to 
> match the interfaces they implement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35195) Move InMemoryTable etc to org.apache.spark.sql.connector.catalog

2021-04-22 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-35195.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32302
[https://github.com/apache/spark/pull/32302]

> Move InMemoryTable etc to org.apache.spark.sql.connector.catalog
> 
>
> Key: SPARK-35195
> URL: https://issues.apache.org/jira/browse/SPARK-35195
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Trivial
> Fix For: 3.2.0
>
>
> Currently test classes such as {{InMemoryTable}} reside in 
> {{org.apache.spark.sql.connector}} rather than 
> {{org.apache.spark.sql.connector.catalog}}. We should move them to latter to 
> match the interfaces they implement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35200) Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager and remove unnecessary code

2021-04-22 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-35200:
-
Fix Version/s: 3.0.1
   3.0.2
   3.1.0
   3.1.1

> Avoid to recompute the pending speculative tasks in the 
> ExecutorAllocationManager and remove unnecessary code
> -
>
> Key: SPARK-35200
> URL: https://issues.apache.org/jira/browse/SPARK-35200
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.2, 3.1.0, 3.1.1
>Reporter: weixiuli
>Priority: Major
> Fix For: 3.0.1, 3.0.2, 3.1.0, 3.1.1
>
>
> The number of the pending speculative tasks is recomputed in the 
> ExecutorAllocationManager to calculate the maximum number of executors 
> required. While , it only needs to be computed once to improve  performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35200) Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager and remove unnecessary code

2021-04-22 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-35200:
-
Description: 
The number of the pending speculative tasks is recomputed in the 
ExecutorAllocationManager to calculate the maximum number of executors 
required. While , it only needs to be computed once to improve  performance.


  was:
The number of the pending speculative tasks is recomputed in the 
ExecutorAllocationManager to calculate the maximum number of executors 
required. while , it only needs to be computed once to improve  performance.



> Avoid to recompute the pending speculative tasks in the 
> ExecutorAllocationManager and remove unnecessary code
> -
>
> Key: SPARK-35200
> URL: https://issues.apache.org/jira/browse/SPARK-35200
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.2, 3.1.0, 3.1.1
>Reporter: weixiuli
>Priority: Major
>
> The number of the pending speculative tasks is recomputed in the 
> ExecutorAllocationManager to calculate the maximum number of executors 
> required. While , it only needs to be computed once to improve  performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35075) Migrate to transformWithPruning or resolveWithPruning for subquery related rules

2021-04-22 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-35075:
--

Assignee: Yingyi Bu

> Migrate to transformWithPruning or resolveWithPruning for subquery related 
> rules
> 
>
> Key: SPARK-35075
> URL: https://issues.apache.org/jira/browse/SPARK-35075
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Major
>
> Migrate transform/resolve functions to transformXxxWithPruning or 
> resolveXxWithPruning, so that we can add tree traversal pruning conditions.
> RewriteCorrelatedScalarSubquery will be supported in a separate issue, 
> SPARK-35148.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35075) Migrate to transformWithPruning or resolveWithPruning for subquery related rules

2021-04-22 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-35075.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32247
[https://github.com/apache/spark/pull/32247]

> Migrate to transformWithPruning or resolveWithPruning for subquery related 
> rules
> 
>
> Key: SPARK-35075
> URL: https://issues.apache.org/jira/browse/SPARK-35075
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Major
> Fix For: 3.2.0
>
>
> Migrate transform/resolve functions to transformXxxWithPruning or 
> resolveXxWithPruning, so that we can add tree traversal pruning conditions.
> RewriteCorrelatedScalarSubquery will be supported in a separate issue, 
> SPARK-35148.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35176) Raise TypeError in inappropriate type case rather than ValueError

2021-04-22 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329971#comment-17329971
 ] 

Yikun Jiang commented on SPARK-35176:
-

I write up a POC in [https://github.com/Yikun/annotation-type-checker/pull/4] 
to add some simple way to do runtime type checker.

>  Raise TypeError in inappropriate type case rather than ValueError
> --
>
> Key: SPARK-35176
> URL: https://issues.apache.org/jira/browse/SPARK-35176
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Yikun Jiang
>Priority: Minor
>
> There are many wrong error type usages on ValueError type.
> When an operation or function is applied to an object of inappropriate type, 
> we should use TypeError rather than ValueError.
> such as:
> [https://github.com/apache/spark/blob/355c39939d9e4c87ffc9538eb822a41cb2ff93fb/python/pyspark/sql/dataframe.py#L1137]
> [https://github.com/apache/spark/blob/355c39939d9e4c87ffc9538eb822a41cb2ff93fb/python/pyspark/sql/dataframe.py#L1228]
>  
> We should do some correction in some right time, note that if we do these 
> corrections, it will break some catch on original ValueError.
>  
> [1] https://docs.python.org/3/library/exceptions.html#TypeError



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35200) Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager and remove unnecessary code

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329949#comment-17329949
 ] 

Apache Spark commented on SPARK-35200:
--

User 'weixiuli' has created a pull request for this issue:
https://github.com/apache/spark/pull/32306

> Avoid to recompute the pending speculative tasks in the 
> ExecutorAllocationManager and remove unnecessary code
> -
>
> Key: SPARK-35200
> URL: https://issues.apache.org/jira/browse/SPARK-35200
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.2, 3.1.0, 3.1.1
>Reporter: weixiuli
>Priority: Major
>
> The number of the pending speculative tasks is recomputed in the 
> ExecutorAllocationManager to calculate the maximum number of executors 
> required. while , it only needs to be computed once to improve  performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35200) Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager and remove unnecessary code

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35200:


Assignee: Apache Spark

> Avoid to recompute the pending speculative tasks in the 
> ExecutorAllocationManager and remove unnecessary code
> -
>
> Key: SPARK-35200
> URL: https://issues.apache.org/jira/browse/SPARK-35200
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.2, 3.1.0, 3.1.1
>Reporter: weixiuli
>Assignee: Apache Spark
>Priority: Major
>
> The number of the pending speculative tasks is recomputed in the 
> ExecutorAllocationManager to calculate the maximum number of executors 
> required. while , it only needs to be computed once to improve  performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35200) Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager and remove unnecessary code

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35200:


Assignee: (was: Apache Spark)

> Avoid to recompute the pending speculative tasks in the 
> ExecutorAllocationManager and remove unnecessary code
> -
>
> Key: SPARK-35200
> URL: https://issues.apache.org/jira/browse/SPARK-35200
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.2, 3.1.0, 3.1.1
>Reporter: weixiuli
>Priority: Major
>
> The number of the pending speculative tasks is recomputed in the 
> ExecutorAllocationManager to calculate the maximum number of executors 
> required. while , it only needs to be computed once to improve  performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35200) Avoid to recompute the pending speculative tasks in the ExecutorAllocationManager and remove unnecessary code

2021-04-22 Thread weixiuli (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

weixiuli updated SPARK-35200:
-
Summary: Avoid to recompute the pending speculative tasks in the 
ExecutorAllocationManager and remove unnecessary code  (was: Avoid to recompute 
the pending tasks in the ExecutorAllocationManager and remove unnecessary code)

> Avoid to recompute the pending speculative tasks in the 
> ExecutorAllocationManager and remove unnecessary code
> -
>
> Key: SPARK-35200
> URL: https://issues.apache.org/jira/browse/SPARK-35200
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core
>Affects Versions: 3.0.2, 3.1.0, 3.1.1
>Reporter: weixiuli
>Priority: Major
>
> The number of the pending speculative tasks is recomputed in the 
> ExecutorAllocationManager to calculate the maximum number of executors 
> required. while , it only needs to be computed once to improve  performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35200) Avoid to recompute the pending tasks in the ExecutorAllocationManager and remove unnecessary code

2021-04-22 Thread weixiuli (Jira)
weixiuli created SPARK-35200:


 Summary: Avoid to recompute the pending tasks in the 
ExecutorAllocationManager and remove unnecessary code
 Key: SPARK-35200
 URL: https://issues.apache.org/jira/browse/SPARK-35200
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler, Spark Core
Affects Versions: 3.1.1, 3.1.0, 3.0.2
Reporter: weixiuli


The number of the pending speculative tasks is recomputed in the 
ExecutorAllocationManager to calculate the maximum number of executors 
required. while , it only needs to be computed once to improve  performance.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21499) Support creating persistent function for Spark UDAF(UserDefinedAggregateFunction)

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329937#comment-17329937
 ] 

Apache Spark commented on SPARK-21499:
--

User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/32305

> Support creating persistent function for Spark 
> UDAF(UserDefinedAggregateFunction)
> -
>
> Key: SPARK-21499
> URL: https://issues.apache.org/jira/browse/SPARK-21499
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> Support creating persistent functions for Spark 
> UDAF(UserDefinedAggregateFunction)
> For example,
> {noformat}
> CREATE FUNCTION myDoubleAvg AS 'test.org.apache.spark.sql.MyDoubleAvg'
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21499) Support creating persistent function for Spark UDAF(UserDefinedAggregateFunction)

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-21499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329936#comment-17329936
 ] 

Apache Spark commented on SPARK-21499:
--

User 'turboFei' has created a pull request for this issue:
https://github.com/apache/spark/pull/32305

> Support creating persistent function for Spark 
> UDAF(UserDefinedAggregateFunction)
> -
>
> Key: SPARK-21499
> URL: https://issues.apache.org/jira/browse/SPARK-21499
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.3.0
>
>
> Support creating persistent functions for Spark 
> UDAF(UserDefinedAggregateFunction)
> For example,
> {noformat}
> CREATE FUNCTION myDoubleAvg AS 'test.org.apache.spark.sql.MyDoubleAvg'
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35199) Tasks are failing with zstd default of spark.shuffle.mapStatus.compression.codec

2021-04-22 Thread Leonard Lausen (Jira)
Leonard Lausen created SPARK-35199:
--

 Summary: Tasks are failing with zstd default of 
spark.shuffle.mapStatus.compression.codec
 Key: SPARK-35199
 URL: https://issues.apache.org/jira/browse/SPARK-35199
 Project: Spark
  Issue Type: Task
  Components: PySpark
Affects Versions: 3.0.1
Reporter: Leonard Lausen


In Spark 3.0.1, tasks fail with the default value of 
{{spark.shuffle.mapStatus.compression.codec=zstd}}, but work without problem 
when changing the value to {{spark.shuffle.mapStatus.compression.codec=lz4}}.

Exemplar backtrace:
 
{code:java}
java.io.IOException: Decompression error: Version not supported at 
com.github.luben.zstd.ZstdInputStream.readInternal(ZstdInputStream.java:164) at 
com.github.luben.zstd.ZstdInputStream.read(ZstdInputStream.java:120) at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
java.io.BufferedInputStream.read1(BufferedInputStream.java:286) at 
java.io.BufferedInputStream.read(BufferedInputStream.java:345) at 
java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2781) at 
java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2797)
 at 
java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3274)
 at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:934) at 
java.io.ObjectInputStream.(ObjectInputStream.java:396) at 
org.apache.spark.MapOutputTracker$.deserializeObject$1(MapOutputTracker.scala:954)
 at 
org.apache.spark.MapOutputTracker$.deserializeMapStatuses(MapOutputTracker.scala:964)
 at 
org.apache.spark.MapOutputTrackerWorker.$anonfun$getStatuses$2(MapOutputTracker.scala:856)
 at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64) at 
org.apache.spark.MapOutputTrackerWorker.getStatuses(MapOutputTracker.scala:851) 
at 
org.apache.spark.MapOutputTrackerWorker.getMapSizesByExecutorId(MapOutputTracker.scala:808)
 at 
org.apache.spark.shuffle.sort.SortShuffleManager.getReader(SortShuffleManager.scala:128)
 at 
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:185) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) 
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:313) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:127) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)  {code}
{{}}
Exemplar code to reproduce the issue
{code:java}
import pyspark.sql.functions as F
df = spark.read.text("s3://my-bucket-with-300GB-compressed-text-files")
df_rand = df.orderBy(F.rand(1))
df_rand.write.text('s3://shuffled-output''){code}
See 
[https://stackoverflow.com/questions/64876463/spark-3-0-1-tasks-are-failing-when-using-zstd-compression-codec]
 for another report of this issue and workaround.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35184) Filtering a dataframe after groupBy and user-define-aggregate-function in Pyspark will cause java.lang.UnsupportedOperationException

2021-04-22 Thread Xiao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Jin updated SPARK-35184:
-
Description: 
I found some strange error when I'm coding Pyspark UDAF. After I call groupBy 
function and agg function, I want to filter some data from remaining dataframe, 
but it seems not work. My sample code is below.
{code:java}
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType, col
>>> df = spark.createDataFrame(
... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
... ("id", "v"))
>>> @pandas_udf("double", PandasUDFType.GROUPED_AGG)
... def mean_udf(v):
... return v.mean()
>>> df.groupby("id").agg(mean_udf(df['v']).alias("mean")).filter(col("mean") > 
>>> 5).show()
{code}
The code above will cause exception printed below
{code:java}
Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/sql/dataframe.py", line 378, in show
print(self._jdf.showString(n, 20, vertical))
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
1257, in __call__
  File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o3717.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(id#1726L, 200)
+- *(1) Filter (mean_udf(v#1727) > 5.0)
   +- Scan ExistingRDD[id#1726L,v#1727]

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.python.AggregateInPandasExec.doExecute(AggregateInPandasExec.scala:80)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
at 
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at 
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
at 

[jira] [Commented] (SPARK-35191) all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

2021-04-22 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329932#comment-17329932
 ] 

Yuming Wang commented on SPARK-35191:
-

Could you check if it works after https://github.com/apache/spark/pull/31993?

> all columns are read even if column pruning applies when spark3.0 read table 
> written by spark2.2
> 
>
> Key: SPARK-35191
> URL: https://issues.apache.org/jira/browse/SPARK-35191
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 3.0.0
> Environment: spark3.0
> spark.sql.hive.convertMetastoreOrc=true(default value in spark3.0)
> spark.sql.orc.impl=native(default value in spark3.0)
>Reporter: xiaoli
>Priority: Major
>
> Before I address this issue, let me talk about the issue background: The 
> current spark version we use is 2.2, and we plan to migrate to spark3.0 in 
> near future. Before migration, we test some query in both spark2.2 and 
> spark3.0 to check potential issue. The data source table of these query is 
> orc format written by spark2.2.
>  
> I find that even if column pruning is applied, spark3.0’s native reader will 
> read all columns.
>  
> Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it 
> will check whether field name is started with “_col”. In my case, field name 
> is started with “_col”, like “_col1”, “_col2”. So pruneCols is not done.  The 
> code is below:
>  
> if (orcFieldNames.forall(_.startsWith("_col"))) {
>   // This is a ORC file written by Hive, no field names in the physical 
> schema, assume the
>   // physical schema maps to the data scheme by index.
>   _assert_(orcFieldNames.length <= dataSchema.length, "The given data schema 
> " +
>     s"*$*{dataSchema.catalogString} has less fields than the actual ORC 
> physical schema, " +
>     "no idea which columns were dropped, fail to read.")
>   // for ORC file written by Hive, no field names
>   // in the physical schema, there is a need to send the
>   // entire dataSchema instead of required schema.
>   // So pruneCols is not done in this case
>   Some(requiredSchema.fieldNames.map { name =>
>     val index = dataSchema.fieldIndex(name)
>     if (index < orcFieldNames.length) {
>       index
>     } else {
>       -1
>     }
>   }, false)
>  
>  Although this code comment explains reason, I still do not understand. This 
> issue only happens in this case: spark3.0 uses native reader to read table 
> written by spark2.2. 
>  
> In other cases, there is no such issue. I do another 2 tests:
> Test1: use spark3.0’s hive reader (running with 
> spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read 
> the same table, it only reads pruned columns.
> Test2: use spark3.0 to write a table, then use spark3.0’s native reader to 
> read this new table, it only reads pruned columns.
>  
> This issue I mentioned is a block we use native reader in spark3.0. Can 
> anyone know further reason or provide solutions?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35040) Remove Spark-version related codes from test codes.

2021-04-22 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-35040.
---
Fix Version/s: 3.2.0
 Assignee: Xinrong Meng
   Resolution: Fixed

Issue resolved by pull request 32300
[https://github.com/apache/spark/pull/32300]

> Remove Spark-version related codes from test codes.
> ---
>
> Key: SPARK-35040
> URL: https://issues.apache.org/jira/browse/SPARK-35040
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> There are several places to check the PySpark version and switch the tests, 
> but now those are not necessary.
> We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35182) Support driver-owned on-demand PVC

2021-04-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-35182:
-

Assignee: Dongjoon Hyun

> Support driver-owned on-demand PVC
> --
>
> Key: SPARK-35182
> URL: https://issues.apache.org/jira/browse/SPARK-35182
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35182) Support driver-owned on-demand PVC

2021-04-22 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-35182.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32288
[https://github.com/apache/spark/pull/32288]

> Support driver-owned on-demand PVC
> --
>
> Key: SPARK-35182
> URL: https://issues.apache.org/jira/browse/SPARK-35182
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35198) Add support for calling debugCodegen from Python & Java

2021-04-22 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-35198:
-
Labels: starter  (was: )

> Add support for calling debugCodegen from Python & Java
> ---
>
> Key: SPARK-35198
> URL: https://issues.apache.org/jira/browse/SPARK-35198
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.1, 3.0.2, 3.1.0, 3.1.1, 3.2.0
>Reporter: Holden Karau
>Priority: Minor
>  Labels: starter
>
> Because it is implimented with an implicit conversion it's a bit complicated 
> to call, we should add a direct method to get debug state for Java & Python 
> users of Dataframes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35198) Add support for calling debugCodegen from Python & Java

2021-04-22 Thread Holden Karau (Jira)
Holden Karau created SPARK-35198:


 Summary: Add support for calling debugCodegen from Python & Java
 Key: SPARK-35198
 URL: https://issues.apache.org/jira/browse/SPARK-35198
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, SQL
Affects Versions: 3.1.1, 3.1.0, 3.0.2, 3.0.1, 3.2.0
Reporter: Holden Karau


Because it is implimented with an implicit conversion it's a bit complicated to 
call, we should add a direct method to get debug state for Java & Python users 
of Dataframes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35197) Accumulators Explore Page on Spark UI in History Server

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35197:


Assignee: (was: Apache Spark)

> Accumulators Explore Page on Spark UI in History Server
> ---
>
> Key: SPARK-35197
> URL: https://issues.apache.org/jira/browse/SPARK-35197
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.4
>Reporter: Frida Montserrat Pulido Padilla
>Priority: Minor
>  Labels: accumulators, ui
> Fix For: 2.4.4
>
>
> Proposition for *Accumulators Explore Page* on *SparkUI*: The particular 
> information for the accumulators will be located under a new tab that has an 
> overview page with links to check for more details about the accumulators 
> information by a particular name or stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35197) Accumulators Explore Page on Spark UI in History Server

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329867#comment-17329867
 ] 

Apache Spark commented on SPARK-35197:
--

User 'FridaPolished' has created a pull request for this issue:
https://github.com/apache/spark/pull/32304

> Accumulators Explore Page on Spark UI in History Server
> ---
>
> Key: SPARK-35197
> URL: https://issues.apache.org/jira/browse/SPARK-35197
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.4
>Reporter: Frida Montserrat Pulido Padilla
>Priority: Minor
>  Labels: accumulators, ui
> Fix For: 2.4.4
>
>
> Proposition for *Accumulators Explore Page* on *SparkUI*: The particular 
> information for the accumulators will be located under a new tab that has an 
> overview page with links to check for more details about the accumulators 
> information by a particular name or stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35197) Accumulators Explore Page on Spark UI in History Server

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35197:


Assignee: Apache Spark

> Accumulators Explore Page on Spark UI in History Server
> ---
>
> Key: SPARK-35197
> URL: https://issues.apache.org/jira/browse/SPARK-35197
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, Web UI
>Affects Versions: 2.4.4
>Reporter: Frida Montserrat Pulido Padilla
>Assignee: Apache Spark
>Priority: Minor
>  Labels: accumulators, ui
> Fix For: 2.4.4
>
>
> Proposition for *Accumulators Explore Page* on *SparkUI*: The particular 
> information for the accumulators will be located under a new tab that has an 
> overview page with links to check for more details about the accumulators 
> information by a particular name or stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35197) Accumulators Explore Page on Spark UI in History Server

2021-04-22 Thread Frida Montserrat Pulido Padilla (Jira)
Frida Montserrat Pulido Padilla created SPARK-35197:
---

 Summary: Accumulators Explore Page on Spark UI in History Server
 Key: SPARK-35197
 URL: https://issues.apache.org/jira/browse/SPARK-35197
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Web UI
Affects Versions: 2.4.4
Reporter: Frida Montserrat Pulido Padilla
 Fix For: 2.4.4


Proposition for *Accumulators Explore Page* on *SparkUI*: The particular 
information for the accumulators will be located under a new tab that has an 
overview page with links to check for more details about the accumulators 
information by a particular name or stage.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35169) Wrong result of min ANSI interval division by -1

2021-04-22 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329405#comment-17329405
 ] 

Max Gekk commented on SPARK-35169:
--

Since https://github.com/apache/spark/pull/32260 has been merged, we could fix 
intervals too. cc [~Gengliang.Wang] [~angerszhuuu] [~beliefer] [~cloud_fan]

> Wrong result of min ANSI interval division by -1
> 
>
> Key: SPARK-35169
> URL: https://issues.apache.org/jira/browse/SPARK-35169
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> The code below portraits the issue:
> {code:scala}
> scala> Seq(java.time.Period.ofMonths(Int.MinValue)).toDF("i").select($"i" / 
> -1).show(false)
> +-+
> |(i / -1) |
> +-+
> |INTERVAL '-178956970-8' YEAR TO MONTH|
> +-+
> scala> Seq(java.time.Duration.of(Long.MinValue, 
> java.time.temporal.ChronoUnit.MICROS)).toDF("i").select($"i" / -1).show(false)
> +---+
> |(i / -1)   |
> +---+
> |INTERVAL '-106751991 04:00:54.775808' DAY TO SECOND|
> +---+
> {code}
> The result cannot be a negative interval. Spark must throw an overflow 
> exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34999) Consolidate PySpark testing utils

2021-04-22 Thread Takuya Ueshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takuya Ueshin resolved SPARK-34999.
---
Fix Version/s: 3.2.0
 Assignee: Xinrong Meng
   Resolution: Fixed

Issue resolved by pull request 32177
[https://github.com/apache/spark/pull/32177]

> Consolidate PySpark testing utils
> -
>
> Key: SPARK-34999
> URL: https://issues.apache.org/jira/browse/SPARK-34999
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.2.0
>
>
> `python/pyspark/pandas/testing` hold test utilites for pandas-on-spark, and 
> `python/pyspark/testing` contain test utilities for pyspark. Consolidating 
> them makes code cleaner and easier to maintain.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34382) ANSI SQL: LATERAL derived table(T491)

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34382:


Assignee: (was: Apache Spark)

> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-34382
> URL: https://issues.apache.org/jira/browse/SPARK-34382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34382) ANSI SQL: LATERAL derived table(T491)

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329395#comment-17329395
 ] 

Apache Spark commented on SPARK-34382:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/32303

> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-34382
> URL: https://issues.apache.org/jira/browse/SPARK-34382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34382) ANSI SQL: LATERAL derived table(T491)

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34382:


Assignee: Apache Spark

> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-34382
> URL: https://issues.apache.org/jira/browse/SPARK-34382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34382) ANSI SQL: LATERAL derived table(T491)

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329394#comment-17329394
 ] 

Apache Spark commented on SPARK-34382:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/32303

> ANSI SQL: LATERAL derived table(T491)
> -
>
> Key: SPARK-34382
> URL: https://issues.apache.org/jira/browse/SPARK-34382
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Subqueries appearing in {{FROM}} can be preceded by the key word {{LATERAL}}. 
> This allows them to reference columns provided by preceding {{FROM}} items. 
> (Without {{LATERAL}}, each subquery is evaluated independently and so cannot 
> cross-reference any other {{FROM}} item.)
> Table functions appearing in {{FROM}} can also be preceded by the key word 
> {{LATERAL}}, but for functions the key word is optional; the function's 
> arguments can contain references to columns provided by preceding {{FROM}} 
> items in any case.
> A {{LATERAL}} item can appear at top level in the {{FROM}} list, or within a 
> {{JOIN}} tree. In the latter case it can also refer to any items that are on 
> the left-hand side of a {{JOIN}} that it is on the right-hand side of.
> When a {{FROM}} item contains {{LATERAL}} cross-references, evaluation 
> proceeds as follows: for each row of the {{FROM}} item providing the 
> cross-referenced column(s), or set of rows of multiple {{FROM}} items 
> providing the columns, the {{LATERAL}} item is evaluated using that row or 
> row set's values of the columns. The resulting row(s) are joined as usual 
> with the rows they were computed from. This is repeated for each row or set 
> of rows from the column source table(s).
> A trivial example of {{LATERAL}} is
> {code:sql}
> SELECT * FROM foo, LATERAL (SELECT * FROM bar WHERE bar.id = foo.bar_id) ss;
> {code}
> *Feature ID*: T491
> [https://www.postgresql.org/docs/11/queries-table-expressions.html#QUERIES-FROM]
> [https://github.com/postgres/postgres/commit/5ebaaa49445eb1ba7b299bbea3a477d4e4c0430]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35195) Move InMemoryTable etc to org.apache.spark.sql.connector.catalog

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35195:


Assignee: Apache Spark

> Move InMemoryTable etc to org.apache.spark.sql.connector.catalog
> 
>
> Key: SPARK-35195
> URL: https://issues.apache.org/jira/browse/SPARK-35195
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Trivial
>
> Currently test classes such as {{InMemoryTable}} reside in 
> {{org.apache.spark.sql.connector}} rather than 
> {{org.apache.spark.sql.connector.catalog}}. We should move them to latter to 
> match the interfaces they implement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35195) Move InMemoryTable etc to org.apache.spark.sql.connector.catalog

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35195:


Assignee: (was: Apache Spark)

> Move InMemoryTable etc to org.apache.spark.sql.connector.catalog
> 
>
> Key: SPARK-35195
> URL: https://issues.apache.org/jira/browse/SPARK-35195
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Trivial
>
> Currently test classes such as {{InMemoryTable}} reside in 
> {{org.apache.spark.sql.connector}} rather than 
> {{org.apache.spark.sql.connector.catalog}}. We should move them to latter to 
> match the interfaces they implement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35195) Move InMemoryTable etc to org.apache.spark.sql.connector.catalog

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35195?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329393#comment-17329393
 ] 

Apache Spark commented on SPARK-35195:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/32302

> Move InMemoryTable etc to org.apache.spark.sql.connector.catalog
> 
>
> Key: SPARK-35195
> URL: https://issues.apache.org/jira/browse/SPARK-35195
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Priority: Trivial
>
> Currently test classes such as {{InMemoryTable}} reside in 
> {{org.apache.spark.sql.connector}} rather than 
> {{org.apache.spark.sql.connector.catalog}}. We should move them to latter to 
> match the interfaces they implement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35196) DataFrameWriter.text support zstd compression

2021-04-22 Thread Leonard Lausen (Jira)
Leonard Lausen created SPARK-35196:
--

 Summary: DataFrameWriter.text support zstd compression
 Key: SPARK-35196
 URL: https://issues.apache.org/jira/browse/SPARK-35196
 Project: Spark
  Issue Type: Task
  Components: PySpark
Affects Versions: 3.1.1
Reporter: Leonard Lausen


[http://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.DataFrameWriter.text.html]
 specifies that only the following compression codecs are supported: `none, 
bzip2, gzip, lz4, snappy and deflate`

However, RDD API supports compression with zstd if users specify 
'org.apache.hadoop.io.compress.ZStandardCodec' compressor in the saveAsTextFile 
method.

Please also expose zstd in the DataFrameWriter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35195) Move InMemoryTable etc to org.apache.spark.sql.connector.catalog

2021-04-22 Thread Chao Sun (Jira)
Chao Sun created SPARK-35195:


 Summary: Move InMemoryTable etc to 
org.apache.spark.sql.connector.catalog
 Key: SPARK-35195
 URL: https://issues.apache.org/jira/browse/SPARK-35195
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.2.0
Reporter: Chao Sun


Currently test classes such as {{InMemoryTable}} reside in 
{{org.apache.spark.sql.connector}} rather than 
{{org.apache.spark.sql.connector.catalog}}. We should move them to latter to 
match the interfaces they implement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35187) Failure on minimal interval literal

2021-04-22 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-35187:


Assignee: angerszhu

> Failure on minimal interval literal
> ---
>
> Key: SPARK-35187
> URL: https://issues.apache.org/jira/browse/SPARK-35187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
>
> If the sign '-' inside of interval string, everything is fine after 
> https://github.com/apache/spark/commit/bb5459fb26b9d0d57eadee8b10b7488607eeaeb2:
> {code:sql}
> spark-sql> SELECT INTERVAL '-178956970-8' YEAR TO MONTH;
> -178956970-8
> {code}
> but the sign outside of interval string is not handled properly:
> {code:sql}
> spark-sql> SELECT INTERVAL -'178956970-8' YEAR TO MONTH;
> Error in query:
> Error parsing interval year-month string: integer overflow(line 1, pos 16)
> == SQL ==
> SELECT INTERVAL -'178956970-8' YEAR TO MONTH
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35187) Failure on minimal interval literal

2021-04-22 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35187.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32296
[https://github.com/apache/spark/pull/32296]

> Failure on minimal interval literal
> ---
>
> Key: SPARK-35187
> URL: https://issues.apache.org/jira/browse/SPARK-35187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> If the sign '-' inside of interval string, everything is fine after 
> https://github.com/apache/spark/commit/bb5459fb26b9d0d57eadee8b10b7488607eeaeb2:
> {code:sql}
> spark-sql> SELECT INTERVAL '-178956970-8' YEAR TO MONTH;
> -178956970-8
> {code}
> but the sign outside of interval string is not handled properly:
> {code:sql}
> spark-sql> SELECT INTERVAL -'178956970-8' YEAR TO MONTH;
> Error in query:
> Error parsing interval year-month string: integer overflow(line 1, pos 16)
> == SQL ==
> SELECT INTERVAL -'178956970-8' YEAR TO MONTH
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35194) Improve readability of NestingColumnAliasing

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329309#comment-17329309
 ] 

Apache Spark commented on SPARK-35194:
--

User 'karenfeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/32301

> Improve readability of NestingColumnAliasing
> 
>
> Key: SPARK-35194
> URL: https://issues.apache.org/jira/browse/SPARK-35194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Karen Feng
>Priority: Major
>
> Refactor 
> https://github.com/apache/spark/blob/6c587d262748a2b469a0786c244e2e555f5f5a74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L31
>  for readability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35194) Improve readability of NestingColumnAliasing

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35194:


Assignee: Apache Spark

> Improve readability of NestingColumnAliasing
> 
>
> Key: SPARK-35194
> URL: https://issues.apache.org/jira/browse/SPARK-35194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Karen Feng
>Assignee: Apache Spark
>Priority: Major
>
> Refactor 
> https://github.com/apache/spark/blob/6c587d262748a2b469a0786c244e2e555f5f5a74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L31
>  for readability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35194) Improve readability of NestingColumnAliasing

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35194:


Assignee: (was: Apache Spark)

> Improve readability of NestingColumnAliasing
> 
>
> Key: SPARK-35194
> URL: https://issues.apache.org/jira/browse/SPARK-35194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Karen Feng
>Priority: Major
>
> Refactor 
> https://github.com/apache/spark/blob/6c587d262748a2b469a0786c244e2e555f5f5a74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L31
>  for readability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35194) Improve readability of NestingColumnAliasing

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329308#comment-17329308
 ] 

Apache Spark commented on SPARK-35194:
--

User 'karenfeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/32301

> Improve readability of NestingColumnAliasing
> 
>
> Key: SPARK-35194
> URL: https://issues.apache.org/jira/browse/SPARK-35194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Karen Feng
>Priority: Major
>
> Refactor 
> https://github.com/apache/spark/blob/6c587d262748a2b469a0786c244e2e555f5f5a74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L31
>  for readability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35194) Improve readability of NestingColumnAliasing

2021-04-22 Thread Karen Feng (Jira)
Karen Feng created SPARK-35194:
--

 Summary: Improve readability of NestingColumnAliasing
 Key: SPARK-35194
 URL: https://issues.apache.org/jira/browse/SPARK-35194
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Karen Feng


Refactor 
https://github.com/apache/spark/blob/6c587d262748a2b469a0786c244e2e555f5f5a74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NestedColumnAliasing.scala#L31
 for readability.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35040) Remove Spark-version related codes from test codes.

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35040:


Assignee: (was: Apache Spark)

> Remove Spark-version related codes from test codes.
> ---
>
> Key: SPARK-35040
> URL: https://issues.apache.org/jira/browse/SPARK-35040
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are several places to check the PySpark version and switch the tests, 
> but now those are not necessary.
> We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35040) Remove Spark-version related codes from test codes.

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329293#comment-17329293
 ] 

Apache Spark commented on SPARK-35040:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/32300

> Remove Spark-version related codes from test codes.
> ---
>
> Key: SPARK-35040
> URL: https://issues.apache.org/jira/browse/SPARK-35040
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are several places to check the PySpark version and switch the tests, 
> but now those are not necessary.
> We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35040) Remove Spark-version related codes from test codes.

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35040:


Assignee: Apache Spark

> Remove Spark-version related codes from test codes.
> ---
>
> Key: SPARK-35040
> URL: https://issues.apache.org/jira/browse/SPARK-35040
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> There are several places to check the PySpark version and switch the tests, 
> but now those are not necessary.
> We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34198) Add RocksDB StateStore as external module

2021-04-22 Thread Yuanjian Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329276#comment-17329276
 ] 

Yuanjian Li commented on SPARK-34198:
-

[~viirya] [~kabhwan] Make sense, I should bring it to the community to confirm 
and list the reasons. I thought it's not a decision but implementation details. 

> Add RocksDB StateStore as external module
> -
>
> Key: SPARK-34198
> URL: https://issues.apache.org/jira/browse/SPARK-34198
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> Currently Spark SS only has one built-in StateStore implementation 
> HDFSBackedStateStore. Actually it uses in-memory map to store state rows. As 
> there are more and more streaming applications, some of them requires to use 
> large state in stateful operations such as streaming aggregation and join.
> Several other major streaming frameworks already use RocksDB for state 
> management. So it is proven to be good choice for large state usage. But 
> Spark SS still lacks of a built-in state store for the requirement.
> We would like to explore the possibility to add RocksDB-based StateStore into 
> Spark SS. For the concern about adding RocksDB as a direct dependency, our 
> plan is to add this StateStore as an external module first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35040) Remove Spark-version related codes from test codes.

2021-04-22 Thread Xinrong Meng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329268#comment-17329268
 ] 

Xinrong Meng commented on SPARK-35040:
--

May I work on this ticket?

> Remove Spark-version related codes from test codes.
> ---
>
> Key: SPARK-35040
> URL: https://issues.apache.org/jira/browse/SPARK-35040
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> There are several places to check the PySpark version and switch the tests, 
> but now those are not necessary.
> We should remove them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35193) Scala/Java compatibility issue Re: how to use externalResource in java transformer from Scala Transformer?

2021-04-22 Thread Arthur (Jira)
Arthur created SPARK-35193:
--

 Summary: Scala/Java compatibility issue Re: how to use 
externalResource in java transformer from Scala Transformer?
 Key: SPARK-35193
 URL: https://issues.apache.org/jira/browse/SPARK-35193
 Project: Spark
  Issue Type: Bug
  Components: Java API, ML
Affects Versions: 3.1.1
Reporter: Arthur


I am trying to make a custom transformer use an externalResource, as it 
requires a large table to do the transformation. I'm not super familiar with 
scala syntax, but from snippets found on the internet I think I've made a 
proper java implementation. I am running into the following error:

Exception in thread "main" java.lang.IllegalArgumentException: requirement 
failed: Param HardMatchDetector_d95b8f699114__externalResource does not belong 
to HardMatchDetector_d95b8f699114.
 at scala.Predef$.require(Predef.scala:281)
 at org.apache.spark.ml.param.Params.shouldOwn(params.scala:851)
 at org.apache.spark.ml.param.Params.set(params.scala:727)
 at org.apache.spark.ml.param.Params.set$(params.scala:726)
 at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:41)
 at org.apache.spark.ml.param.Params.set(params.scala:713)
 at org.apache.spark.ml.param.Params.set$(params.scala:712)
 at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:41)
 at HardMatchDetector.setResource(HardMatchDetector.java:45)

 

Code as follows:
{code:java}
public class HardMatchDetector extends Transformer implements 
DefaultParamsWritable, DefaultParamsReadable, Serializable {
public String inputColumn = "value";
 public String outputColumn = "hardMatches";
 private ExternalResourceParam resourceParam = new ExternalResourceParam(this, 
"externalResource", "external resource, parquet file with 2 columns, one names 
and one wordcount");;
 private String uid;
public HardMatchDetector setResource(final ExternalResource value)
{ return (HardMatchDetector)this.set(this.resourceParam, value); }
public HardMatchDetector setResource(final String path)
{ return this.setResource(new ExternalResource(path, ReadAs.TEXT(), new 
HashMap())); }
@Override
 public String uid()
{ return getUid(); }
private String getUid() {
 if (uid == null)
{ uid = Identifiable$.MODULE$.randomUID("HardMatchDetector"); }
return uid;
 }
@Override
 public Dataset transform(final Dataset dataset)
{ return dataset; }
@Override
 public StructType transformSchema(StructType schema)
{ return schema.add(DataTypes.createStructField(outputColumn, 
DataTypes.StringType, true)); }
@Override
 public Transformer copy(ParamMap extra)
{ return new HardMatchDetector(); }
}
public class HardMatcherTest extends AbstractSparkTest
{ @Test 
public void test() 
{ 
var hardMatcher = new HardMatchDetector().setResource(pathName); }
}
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34483) Make check point compress codec configurable

2021-04-22 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329247#comment-17329247
 ] 

Holden Karau commented on SPARK-34483:
--

Are you still working on this?

> Make check point compress codec configurable 
> -
>
> Key: SPARK-34483
> URL: https://issues.apache.org/jira/browse/SPARK-34483
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Make check point compress codec configurable



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34953) inferSchema for type date

2021-04-22 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-34953:
-
Component/s: SQL

> inferSchema for type date 
> --
>
> Key: SPARK-34953
> URL: https://issues.apache.org/jira/browse/SPARK-34953
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 3.1.1
>Reporter: Tomas Hudik
>Priority: Minor
>
> Reading a csv file with 
> `option({color:#6a8759}"inferSchema"{color}{color:#cc7832},{color}{color:#6a8759}"true"{color})`
>  doesnt work with `date` type.  E.g. 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVInferSchema.scala#L101:L119]
>  - can process only `Timestamp` not a `Date`
>  
> Datasets often contain `Date` type therefore reading a file to Spark should 
> be able to infer `Date` type to a column.
> For now, only work-arounds (e.g. 
> [https://stackoverflow.com/a/46595057/1408096] , or 
> [https://stackoverflow.com/questions/66935214/spark-reading-csv-with-specified-date-format]
>  ) are possible/



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34997) Make graceful shutdown controllable by property

2021-04-22 Thread Holden Karau (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329244#comment-17329244
 ] 

Holden Karau commented on SPARK-34997:
--

We could add a configuration for which signals spark should enable graceful 
shutdown.

> Make graceful shutdown controllable by property
> ---
>
> Key: SPARK-34997
> URL: https://issues.apache.org/jira/browse/SPARK-34997
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Oleg Kuznetsov
>Priority: Major
>
> I'm using Spark in embedded mode, i.e. Spark is one of many components in my 
> app.
> Currently, when JVM received sigterm, Spark immediately start shutting down.
>  
> Expected: provide client code with an option to control how and when to 
> shutdown Spark app, i.e. create a config to disable unconditional shutdown by 
> Spark itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34363) Allow users to configure a maximum amount of remote shuffle block storage

2021-04-22 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-34363.
--
Fix Version/s: 3.2.0
 Assignee: Holden Karau
   Resolution: Fixed

> Allow users to configure a maximum amount of remote shuffle block storage
> -
>
> Key: SPARK-34363
> URL: https://issues.apache.org/jira/browse/SPARK-34363
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34079) Improvement CTE table scan

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34079:


Assignee: (was: Apache Spark)

> Improvement CTE table scan
> --
>
> Key: SPARK-34079
> URL: https://issues.apache.org/jira/browse/SPARK-34079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Prepare table:
> {code:sql}
> CREATE TABLE store_sales (  ss_sold_date_sk INT,  ss_sold_time_sk INT,  
> ss_item_sk INT,  ss_customer_sk INT,  ss_cdemo_sk INT,  ss_hdemo_sk INT,  
> ss_addr_sk INT,  ss_store_sk INT,  ss_promo_sk INT,  ss_ticket_number INT,  
> ss_quantity INT,  ss_wholesale_cost DECIMAL(7,2),  ss_list_price 
> DECIMAL(7,2),  ss_sales_price DECIMAL(7,2),  ss_ext_discount_amt 
> DECIMAL(7,2),  ss_ext_sales_price DECIMAL(7,2),  ss_ext_wholesale_cost 
> DECIMAL(7,2),  ss_ext_list_price DECIMAL(7,2),  ss_ext_tax DECIMAL(7,2),  
> ss_coupon_amt DECIMAL(7,2),  ss_net_paid DECIMAL(7,2),  ss_net_paid_inc_tax 
> DECIMAL(7,2),ss_net_profit DECIMAL(7,2));
> CREATE TABLE reason (  r_reason_sk INT,  r_reason_id varchar(255),  
> r_reason_desc varchar(255));
> {code}
> SQL:
> {code:sql}
> WITH bucket_result AS (
> SELECT
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_quantity 
> END)) > 62316685
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_net_paid END)) 
> END bucket1,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN 
> ss_quantity END)) > 19045798
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN ss_net_paid END)) 
> END bucket2,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN 
> ss_quantity END)) > 365541424
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN ss_net_paid END)) 
> END bucket3,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN 
> ss_quantity END)) > 19045798
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN ss_net_paid END)) 
> END bucket4,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN 
> ss_quantity END)) > 365541424
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN ss_net_paid END)) 
> END bucket5
>   FROM store_sales
> )
> SELECT
>   (SELECT bucket1 FROM bucket_result) as bucket1,
>   (SELECT bucket2 FROM bucket_result) as bucket2,
>   (SELECT bucket3 FROM bucket_result) as bucket3,
>   (SELECT bucket4 FROM bucket_result) as bucket4,
>   (SELECT bucket5 FROM bucket_result) as bucket5
> FROM reason
> WHERE r_reason_sk = 1;
> {code}
> Plan of Spark SQL:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Project [Subquery subquery#0, [id=#23] AS bucket1#1, Subquery subquery#2, 
> [id=#34] AS bucket2#3, Subquery subquery#4, [id=#45] AS bucket3#5, Subquery 
> subquery#6, [id=#56] AS bucket4#7, Subquery subquery#8, [id=#67] AS bucket5#9]
>:  :- Subquery subquery#0, [id=#23]
>:  :  +- AdaptiveSparkPlan isFinalPlan=false
>:  : +- HashAggregate(keys=[], functions=[count(if (((ss_quantity#28 
> >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else null), 
> avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) 
> ss_ext_discount_amt#32 else null)), avg(UnscaledValue(if (((ss_quantity#28 >= 
> 1) AND (ss_quantity#28 <= 20))) ss_net_paid#38 else null))])
>:  :+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#21]
>:  :   +- HashAggregate(keys=[], functions=[partial_count(if 
> (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else 
> null), partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND 
> (ss_quantity#28 <= 20))) ss_ext_discount_amt#32 else null)), 
> partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 
> 20))) ss_net_paid#38 else null))])
>:  :  +- FileScan parquet 
> default.store_sales[ss_quantity#28,ss_ext_discount_amt#32,ss_net_paid#38] 
> Batched: true, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-28169/spark-warehouse/org.apache.spark.sql.Data...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>:  :- Subquery subquery#2, [id=#34]
>:  :  +- AdaptiveSparkPlan isFinalPlan=false
>:  : +- HashAggregate(keys=[], functions=[count(if 

[jira] [Comment Edited] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path

2021-04-22 Thread Brady Tello (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329132#comment-17329132
 ] 

Brady Tello edited comment on SPARK-34883 at 4/22/21, 1:52 PM:
---

[~Vikas_Yadav]

You don't have a colon in your `inputFile` path.  See the following code that 
contains a colon in the path to the dataset.  It fails with a 
URISyntaxException as expected:
{code:java}
>>> inputFile = "/Users/home_dir/Workspaces/datasets/with:colon/iris.csv" 
>>> tempDF = spark.read.csv(inputFile, multiLine=True) 
Traceback (most recent call last): File "", line 1, in  File 
"/Users/home_dir/Workspaces/spark/python/pyspark/sql/readwriter.py", line 737, 
in csv return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) File 
"/Users/home_dir/Workspaces/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1304, in __call__ File 
"/Users/home_dir/Workspaces/spark/python/pyspark/sql/utils.py", line 117, in 
deco raise converted from None pyspark.sql.utils.IllegalArgumentException: 
java.net.URISyntaxException: Relative path in absolute URI: with:colon
{code}


was (Author: bctello8):
[~Vikas_Yadav]

You don't have a colon in your `inputFile` path.  See the following code that 
contains a colon in the path to the dataset.  It fails with a 
URISyntaxException as expected:
{code:java}
>>> inputFile = "/Users/home_dir/Workspaces/datasets/with:colon/iris.csv" 
>>> tempDF = spark.read.csv(inputFile, multiLine=True) 
Traceback (most recent call last): File "", line 1, in  File 
"/Users/brady.tello/Workspaces/spark/python/pyspark/sql/readwriter.py", line 
737, in csv return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) File 
"/Users/brady.tello/Workspaces/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1304, in __call__ File 
"/Users/brady.tello/Workspaces/spark/python/pyspark/sql/utils.py", line 117, in 
deco raise converted from None pyspark.sql.utils.IllegalArgumentException: 
java.net.URISyntaxException: Relative path in absolute URI: with:colon
{code}

> Setting CSV reader option "multiLine" to "true" causes URISyntaxException 
> when colon is in file path
> 
>
> Key: SPARK-34883
> URL: https://issues.apache.org/jira/browse/SPARK-34883
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.1
>Reporter: Brady Tello
>Priority: Major
>
> Setting the CSV reader's "multiLine" option to "True" throws the following 
> exception when a ':' character is in the file path.
>  
> {code:java}
> java.net.URISyntaxException: Relative path in absolute URI: test:dir
> {code}
> I've tested this in both Spark 3.0.0 and Spark 3.1.1 and I get the same error 
> whether I use Scala, Python, or SQL.
> The following code works fine:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> tempDF = (spark.read.option("sep", "\t").csv(csvFile)
> {code}
> While the following code fails:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv"
> tempDF = (spark.read.option("sep", "\t").option("multiLine", 
> "True").csv(csvFile)
> {code}
> Full Stack Trace from Python:
>  
> {code:java}
> --- 
> IllegalArgumentException Traceback (most recent call last)  
> in  
> 3 csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> 4 
> > 5  tempDF = (spark.read.option("sep", "\t").option("multiLine", "True") 
> /databricks/spark/python/pyspark/sql/readwriter.py in csv(self, path, schema, 
> sep, encoding, quote, escape, comment, header, inferSchema, 
> ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, 
> positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, 
> maxCharsPerColumn, maxMalformedLogPerPartition, mode, 
> columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, 
> samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, 
> recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling) 
> 735 path = [path] 
> 736 if type(path) == list: 
> --> 737 return 
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) 
> 738 elif isinstance(path, RDD): 
> 739 def func(iterator): 
> /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args) 
> 1302 
> 1303 answer = self.gateway_client.send_command(command) 
> -> 1304 return_value = get_return_value( 
> 1305 answer, self.gateway_client, self.target_id, self.name) 
> 1306 
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 
> 114 # Hide where the exception came from that shows a non-Pythonic 
> 115 # JVM exception 

[jira] [Updated] (SPARK-33121) Spark Streaming 3.1.1 hangs on shutdown

2021-04-22 Thread Dmitry Tverdokhleb (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Priority: Critical  (was: Major)

> Spark Streaming 3.1.1 hangs on shutdown
> ---
>
> Key: SPARK-33121
> URL: https://issues.apache.org/jira/browse/SPARK-33121
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.1.1
>Reporter: Dmitry Tverdokhleb
>Priority: Critical
>
> Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem 
> in graceful shutdown.
> Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".
> Here is the code:
> {code:java}
> inputStream.foreachRDD {
>   rdd =>
> rdd.foreachPartition {
> Thread.sleep(5000)
> }
> }
> {code}
> I send a SIGTERM signal to stop the spark streaming and after sleeping an 
> exception arises:
> {noformat}
> streaming-agg-tds-data_1  | java.util.concurrent.RejectedExecutionException: 
> Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from 
> java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach$(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach(IterableLike.scala:74)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> streaming-agg-tds-data_1  | at java.lang.Thread.run(Thread.java:748)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 WARN  JobGenerator - Timed 
> out while stopping the job generator (timeout = 1)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Waited 
> for jobs to be processed and checkpoints to be written
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Stopped 
> JobGenerator{noformat}
> After this exception and "JobGenerator - Stopped JobGenerator" log, streaming 
> freezes, and halts by timeout (Config parameter 
> "hadoop.service.shutdown.timeout").
> Besides, there is no problem with the graceful shutdown in spark 2.4.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34079) Improvement CTE table scan

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34079:


Assignee: Apache Spark

> Improvement CTE table scan
> --
>
> Key: SPARK-34079
> URL: https://issues.apache.org/jira/browse/SPARK-34079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Prepare table:
> {code:sql}
> CREATE TABLE store_sales (  ss_sold_date_sk INT,  ss_sold_time_sk INT,  
> ss_item_sk INT,  ss_customer_sk INT,  ss_cdemo_sk INT,  ss_hdemo_sk INT,  
> ss_addr_sk INT,  ss_store_sk INT,  ss_promo_sk INT,  ss_ticket_number INT,  
> ss_quantity INT,  ss_wholesale_cost DECIMAL(7,2),  ss_list_price 
> DECIMAL(7,2),  ss_sales_price DECIMAL(7,2),  ss_ext_discount_amt 
> DECIMAL(7,2),  ss_ext_sales_price DECIMAL(7,2),  ss_ext_wholesale_cost 
> DECIMAL(7,2),  ss_ext_list_price DECIMAL(7,2),  ss_ext_tax DECIMAL(7,2),  
> ss_coupon_amt DECIMAL(7,2),  ss_net_paid DECIMAL(7,2),  ss_net_paid_inc_tax 
> DECIMAL(7,2),ss_net_profit DECIMAL(7,2));
> CREATE TABLE reason (  r_reason_sk INT,  r_reason_id varchar(255),  
> r_reason_desc varchar(255));
> {code}
> SQL:
> {code:sql}
> WITH bucket_result AS (
> SELECT
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_quantity 
> END)) > 62316685
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_net_paid END)) 
> END bucket1,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN 
> ss_quantity END)) > 19045798
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN ss_net_paid END)) 
> END bucket2,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN 
> ss_quantity END)) > 365541424
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN ss_net_paid END)) 
> END bucket3,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN 
> ss_quantity END)) > 19045798
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN ss_net_paid END)) 
> END bucket4,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN 
> ss_quantity END)) > 365541424
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN ss_net_paid END)) 
> END bucket5
>   FROM store_sales
> )
> SELECT
>   (SELECT bucket1 FROM bucket_result) as bucket1,
>   (SELECT bucket2 FROM bucket_result) as bucket2,
>   (SELECT bucket3 FROM bucket_result) as bucket3,
>   (SELECT bucket4 FROM bucket_result) as bucket4,
>   (SELECT bucket5 FROM bucket_result) as bucket5
> FROM reason
> WHERE r_reason_sk = 1;
> {code}
> Plan of Spark SQL:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Project [Subquery subquery#0, [id=#23] AS bucket1#1, Subquery subquery#2, 
> [id=#34] AS bucket2#3, Subquery subquery#4, [id=#45] AS bucket3#5, Subquery 
> subquery#6, [id=#56] AS bucket4#7, Subquery subquery#8, [id=#67] AS bucket5#9]
>:  :- Subquery subquery#0, [id=#23]
>:  :  +- AdaptiveSparkPlan isFinalPlan=false
>:  : +- HashAggregate(keys=[], functions=[count(if (((ss_quantity#28 
> >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else null), 
> avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) 
> ss_ext_discount_amt#32 else null)), avg(UnscaledValue(if (((ss_quantity#28 >= 
> 1) AND (ss_quantity#28 <= 20))) ss_net_paid#38 else null))])
>:  :+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#21]
>:  :   +- HashAggregate(keys=[], functions=[partial_count(if 
> (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else 
> null), partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND 
> (ss_quantity#28 <= 20))) ss_ext_discount_amt#32 else null)), 
> partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 
> 20))) ss_net_paid#38 else null))])
>:  :  +- FileScan parquet 
> default.store_sales[ss_quantity#28,ss_ext_discount_amt#32,ss_net_paid#38] 
> Batched: true, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-28169/spark-warehouse/org.apache.spark.sql.Data...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>:  :- Subquery subquery#2, [id=#34]
>:  :  +- AdaptiveSparkPlan isFinalPlan=false
>:  : +- HashAggregate(keys=[], 

[jira] [Assigned] (SPARK-35110) Handle ANSI intervals in WindowExecBase

2021-04-22 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-35110:


Assignee: jiaan.geng

> Handle ANSI intervals in WindowExecBase
> ---
>
> Key: SPARK-35110
> URL: https://issues.apache.org/jira/browse/SPARK-35110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: jiaan.geng
>Priority: Major
>
> Handle YearMonthIntervalType and DayTimeIntervalType in createBoundOrdering():
> https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala#L97-L99



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully

2021-04-22 Thread Dmitry Tverdokhleb (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Description: 
Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd.foreachPartition {
Thread.sleep(5000)
}
}
{code}
I send a SIGTERM signal to stop the spark streaming and after sleeping an 
exception arises:
{noformat}
streaming-agg-tds-data_1  | java.util.concurrent.RejectedExecutionException: 
Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from 
java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 1]
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
streaming-agg-tds-data_1  | at 
org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
streaming-agg-tds-data_1  | at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
streaming-agg-tds-data_1  | at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
streaming-agg-tds-data_1  | at 
scala.collection.Iterator.foreach(Iterator.scala:941)
streaming-agg-tds-data_1  | at 
scala.collection.Iterator.foreach$(Iterator.scala:941)
streaming-agg-tds-data_1  | at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
streaming-agg-tds-data_1  | at 
scala.collection.IterableLike.foreach(IterableLike.scala:74)
streaming-agg-tds-data_1  | at 
scala.collection.IterableLike.foreach$(IterableLike.scala:73)
streaming-agg-tds-data_1  | at 
scala.collection.AbstractIterable.foreach(Iterable.scala:56)
streaming-agg-tds-data_1  | at 
org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
streaming-agg-tds-data_1  | at 
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
streaming-agg-tds-data_1  | at 
org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
streaming-agg-tds-data_1  | at 
org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
streaming-agg-tds-data_1  | at 
org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
streaming-agg-tds-data_1  | at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
streaming-agg-tds-data_1  | at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
streaming-agg-tds-data_1  | at java.lang.Thread.run(Thread.java:748)
streaming-agg-tds-data_1  | 2021-04-22 13:33:41 WARN  JobGenerator - Timed out 
while stopping the job generator (timeout = 1)
streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Waited for 
jobs to be processed and checkpoints to be written
streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Stopped 
JobGenerator{noformat}
After this exception and "JobGenerator - Stopped JobGenerator", streaming 
freezes and halts by timeout (Config parameter 
"hadoop.service.shutdown.timeout").

Besides, there is no problem with the graceful shutdown in spark 2.4.5.
  
  
  

  was:
Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .foreachPartition {
Thread.sleep(5000)
  }
}
{code}
I send a SIGTERM signal to stop the spark streaming, but exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at 

[jira] [Commented] (SPARK-28098) Native ORC reader doesn't support subdirectories with Hive tables

2021-04-22 Thread Yu-Tang Lin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329120#comment-17329120
 ] 

Yu-Tang Lin commented on SPARK-28098:
-

Hi [~ddrinka], I already made a PR to resolve the issue, maybe we could take a 
look on it?

> Native ORC reader doesn't support subdirectories with Hive tables
> -
>
> Key: SPARK-28098
> URL: https://issues.apache.org/jira/browse/SPARK-28098
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Douglas Drinka
>Priority: Major
>
> The Hive ORC reader supports recursive directory reads from S3.  Spark's 
> native ORC reader supports recursive directory reads, but not when used with 
> Hive.
>  
> {code:java}
> val testData = List(1,2,3,4,5)
> val dataFrame = testData.toDF()
> dataFrame
> .coalesce(1)
> .write
> .mode(SaveMode.Overwrite)
> .format("orc")
> .option("compression", "zlib")
> .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/")
> spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest")
> spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS 
> ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'")
> spark.conf.set("hive.mapred.supports.subdirectories","true")
> spark.conf.set("mapred.input.dir.recursive","true")
> spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
> spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true")
> println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
> //0
> spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false")
> println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
> //5{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35110) Handle ANSI intervals in WindowExecBase

2021-04-22 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-35110.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32294
[https://github.com/apache/spark/pull/32294]

> Handle ANSI intervals in WindowExecBase
> ---
>
> Key: SPARK-35110
> URL: https://issues.apache.org/jira/browse/SPARK-35110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> Handle YearMonthIntervalType and DayTimeIntervalType in createBoundOrdering():
> https://github.com/apache/spark/blob/0494dc90af48ce7da0625485a4dc6917a244d580/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExecBase.scala#L97-L99



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-35184) Filtering a dataframe after groupBy and user-define-aggregate-function in Pyspark will cause java.lang.UnsupportedOperationException

2021-04-22 Thread Xiao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Jin updated SPARK-35184:
-
Description: 
I found some strange error when I'm coding Pyspark UDAF. After I call groupBy 
function and agg function, I want to filter some data from remaining dataframe, 
but it seems not work. My sample code is below.
{code:java}
>>> from pyspark.sql.functions import pandas_udf, PandasUDFType, col
>>> df = spark.createDataFrame(
... [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
... ("id", "v"))
>>> @pandas_udf("double", PandasUDFType.GROUPED_AGG)
... def mean_udf(v):
... return v.mean()
>>> df.groupby("id").agg(mean_udf(df['v']).alias("mean")).filter(col("mean") > 
>>> 5).show()
{code}
The code above will cause exception printed below
{code:java}
Traceback (most recent call last):
  File "", line 1, in 
  File "/opt/spark/python/pyspark/sql/dataframe.py", line 378, in show
print(self._jdf.showString(n, 20, vertical))
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 
1257, in __call__
  File "/opt/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
  File "/opt/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, 
in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o3717.showString.
: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: execute, tree:
Exchange hashpartitioning(id#1726L, 200)
+- *(1) Filter (mean_udf(v#1727) > 5.0)
   +- Scan ExistingRDD[id#1726L,v#1727]

at 
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
at org.apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.python.AggregateInPandasExec.doExecute(AggregateInPandasExec.scala:80)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at 
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
at 
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3383)
at 
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at 
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2544)
at org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3364)
at 
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3363)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2544)
at 

[jira] [Commented] (SPARK-35187) Failure on minimal interval literal

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329118#comment-17329118
 ] 

Apache Spark commented on SPARK-35187:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32296

> Failure on minimal interval literal
> ---
>
> Key: SPARK-35187
> URL: https://issues.apache.org/jira/browse/SPARK-35187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> If the sign '-' inside of interval string, everything is fine after 
> https://github.com/apache/spark/commit/bb5459fb26b9d0d57eadee8b10b7488607eeaeb2:
> {code:sql}
> spark-sql> SELECT INTERVAL '-178956970-8' YEAR TO MONTH;
> -178956970-8
> {code}
> but the sign outside of interval string is not handled properly:
> {code:sql}
> spark-sql> SELECT INTERVAL -'178956970-8' YEAR TO MONTH;
> Error in query:
> Error parsing interval year-month string: integer overflow(line 1, pos 16)
> == SQL ==
> SELECT INTERVAL -'178956970-8' YEAR TO MONTH
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33121) Spark Streaming 3.1.1 hangs on shutdown

2021-04-22 Thread Dmitry Tverdokhleb (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Summary: Spark Streaming 3.1.1 hangs on shutdown  (was: Spark does not 
shutdown gracefully)

> Spark Streaming 3.1.1 hangs on shutdown
> ---
>
> Key: SPARK-33121
> URL: https://issues.apache.org/jira/browse/SPARK-33121
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.0.1
>Reporter: Dmitry Tverdokhleb
>Priority: Major
>
> Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem 
> in graceful shutdown.
> Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".
> Here is the code:
> {code:java}
> inputStream.foreachRDD {
>   rdd =>
> rdd.foreachPartition {
> Thread.sleep(5000)
> }
> }
> {code}
> I send a SIGTERM signal to stop the spark streaming and after sleeping an 
> exception arises:
> {noformat}
> streaming-agg-tds-data_1  | java.util.concurrent.RejectedExecutionException: 
> Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from 
> java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach$(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach(IterableLike.scala:74)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> streaming-agg-tds-data_1  | at java.lang.Thread.run(Thread.java:748)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 WARN  JobGenerator - Timed 
> out while stopping the job generator (timeout = 1)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Waited 
> for jobs to be processed and checkpoints to be written
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Stopped 
> JobGenerator{noformat}
> After this exception and "JobGenerator - Stopped JobGenerator" log, streaming 
> freezes, and halts by timeout (Config parameter 
> "hadoop.service.shutdown.timeout").
> Besides, there is no problem with the graceful shutdown in spark 2.4.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34079) Improvement CTE table scan

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329138#comment-17329138
 ] 

Apache Spark commented on SPARK-34079:
--

User 'peter-toth' has created a pull request for this issue:
https://github.com/apache/spark/pull/32298

> Improvement CTE table scan
> --
>
> Key: SPARK-34079
> URL: https://issues.apache.org/jira/browse/SPARK-34079
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> Prepare table:
> {code:sql}
> CREATE TABLE store_sales (  ss_sold_date_sk INT,  ss_sold_time_sk INT,  
> ss_item_sk INT,  ss_customer_sk INT,  ss_cdemo_sk INT,  ss_hdemo_sk INT,  
> ss_addr_sk INT,  ss_store_sk INT,  ss_promo_sk INT,  ss_ticket_number INT,  
> ss_quantity INT,  ss_wholesale_cost DECIMAL(7,2),  ss_list_price 
> DECIMAL(7,2),  ss_sales_price DECIMAL(7,2),  ss_ext_discount_amt 
> DECIMAL(7,2),  ss_ext_sales_price DECIMAL(7,2),  ss_ext_wholesale_cost 
> DECIMAL(7,2),  ss_ext_list_price DECIMAL(7,2),  ss_ext_tax DECIMAL(7,2),  
> ss_coupon_amt DECIMAL(7,2),  ss_net_paid DECIMAL(7,2),  ss_net_paid_inc_tax 
> DECIMAL(7,2),ss_net_profit DECIMAL(7,2));
> CREATE TABLE reason (  r_reason_sk INT,  r_reason_id varchar(255),  
> r_reason_desc varchar(255));
> {code}
> SQL:
> {code:sql}
> WITH bucket_result AS (
> SELECT
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_quantity 
> END)) > 62316685
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 1 AND 20 THEN ss_net_paid END)) 
> END bucket1,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN 
> ss_quantity END)) > 19045798
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 21 AND 40 THEN ss_net_paid END)) 
> END bucket2,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN 
> ss_quantity END)) > 365541424
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 41 AND 60 THEN ss_net_paid END)) 
> END bucket3,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN 
> ss_quantity END)) > 19045798
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 61 AND 80 THEN ss_net_paid END)) 
> END bucket4,
> CASE WHEN (count (CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN 
> ss_quantity END)) > 365541424
>   THEN (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN 
> ss_ext_discount_amt END))
> ELSE (avg(CASE WHEN ss_quantity BETWEEN 81 AND 100 THEN ss_net_paid END)) 
> END bucket5
>   FROM store_sales
> )
> SELECT
>   (SELECT bucket1 FROM bucket_result) as bucket1,
>   (SELECT bucket2 FROM bucket_result) as bucket2,
>   (SELECT bucket3 FROM bucket_result) as bucket3,
>   (SELECT bucket4 FROM bucket_result) as bucket4,
>   (SELECT bucket5 FROM bucket_result) as bucket5
> FROM reason
> WHERE r_reason_sk = 1;
> {code}
> Plan of Spark SQL:
> {noformat}
> == Physical Plan ==
> AdaptiveSparkPlan isFinalPlan=false
> +- Project [Subquery subquery#0, [id=#23] AS bucket1#1, Subquery subquery#2, 
> [id=#34] AS bucket2#3, Subquery subquery#4, [id=#45] AS bucket3#5, Subquery 
> subquery#6, [id=#56] AS bucket4#7, Subquery subquery#8, [id=#67] AS bucket5#9]
>:  :- Subquery subquery#0, [id=#23]
>:  :  +- AdaptiveSparkPlan isFinalPlan=false
>:  : +- HashAggregate(keys=[], functions=[count(if (((ss_quantity#28 
> >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else null), 
> avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) 
> ss_ext_discount_amt#32 else null)), avg(UnscaledValue(if (((ss_quantity#28 >= 
> 1) AND (ss_quantity#28 <= 20))) ss_net_paid#38 else null))])
>:  :+- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#21]
>:  :   +- HashAggregate(keys=[], functions=[partial_count(if 
> (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 20))) ss_quantity#28 else 
> null), partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND 
> (ss_quantity#28 <= 20))) ss_ext_discount_amt#32 else null)), 
> partial_avg(UnscaledValue(if (((ss_quantity#28 >= 1) AND (ss_quantity#28 <= 
> 20))) ss_net_paid#38 else null))])
>:  :  +- FileScan parquet 
> default.store_sales[ss_quantity#28,ss_ext_discount_amt#32,ss_net_paid#38] 
> Batched: true, DataFilters: [], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/spark/SPARK-28169/spark-warehouse/org.apache.spark.sql.Data...,
>  PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
>:  :- Subquery subquery#2, [id=#34]
>:  :  +- 

[jira] [Assigned] (SPARK-35187) Failure on minimal interval literal

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35187:


Assignee: (was: Apache Spark)

> Failure on minimal interval literal
> ---
>
> Key: SPARK-35187
> URL: https://issues.apache.org/jira/browse/SPARK-35187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> If the sign '-' inside of interval string, everything is fine after 
> https://github.com/apache/spark/commit/bb5459fb26b9d0d57eadee8b10b7488607eeaeb2:
> {code:sql}
> spark-sql> SELECT INTERVAL '-178956970-8' YEAR TO MONTH;
> -178956970-8
> {code}
> but the sign outside of interval string is not handled properly:
> {code:sql}
> spark-sql> SELECT INTERVAL -'178956970-8' YEAR TO MONTH;
> Error in query:
> Error parsing interval year-month string: integer overflow(line 1, pos 16)
> == SQL ==
> SELECT INTERVAL -'178956970-8' YEAR TO MONTH
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35192) Port minimal TPC-DS datagen code from databricks/spark-sql-perf

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329108#comment-17329108
 ] 

Apache Spark commented on SPARK-35192:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32243

> Port minimal TPC-DS datagen code from databricks/spark-sql-perf
> ---
>
> Key: SPARK-35192
> URL: https://issues.apache.org/jira/browse/SPARK-35192
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This PR aims at porting minimal code to generate TPC-DS data from 
> databricks/spark-sql-perf. The classes in a new class file tpcdsDatagen.scala 
> are basically copied from the databricks/spark-sql-perf codebase.
> We frequently use TPCDS data now for benchmarks/tests, but the classes for 
> the TPCDS schemas of datagen and benchmarks/tests are managed separately, 
> e.g.,
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
> https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala
> I think this causes some inconveniences, e.g., we need to update both files 
> in the separate repositories if we update the TPCDS schema #32037. So, it 
> would be useful for the Spark codebase to generate them by referring to the 
> same schema definition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully

2021-04-22 Thread Dmitry Tverdokhleb (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Description: 
Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .foreachPartition {
Thread.sleep(5000)
  }
}
{code}
I send a SIGTERM signal to stop the spark streaming, but exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
at org.apache.spark.executor.Executor.launchTask(Executor.scala:230)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at 
org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
at 
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
A piece of logs:
{noformat}
...
Calling rdd.mapPartitions
...
Sending SIGTERM signal
...
2020-10-12 14:12:22 INFO  MyProject - Shutdown hook called
2020-10-12 14:12:22 DEBUG JobScheduler - Stopping JobScheduler
2020-10-12 14:12:22 INFO  ReceiverTracker - ReceiverTracker stopped
2020-10-12 14:12:22 INFO  JobGenerator - Stopping JobGenerator gracefully
2020-10-12 14:12:22 INFO  JobGenerator - Waiting for all received blocks to be 
consumed for job generation
2020-10-12 14:12:22 INFO  JobGenerator - Waited for all received blocks to be 
consumed for job generation
...
Calling rdd.filter
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]{noformat}
After this exception, streaming freezes and halts by timeout (Config parameter 
"hadoop.service.shutdown.timeout").

Pay attention, this exception arises only for RDD operations (Like map, filter, 
etc.), business logic is processing normally without any errors.

Besides, there is no problem with graceful shutdown in spark 2.4.5.
  
  
  
  

  was:
Hi. I am trying to migrate from spark 2.4.5 to 3.0.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd
  .mapPartitions {
// Some operations mapPartitions
  }
  .filter {
// Some operations filter
  }
  .groupBy {
// Some operatons groupBy
  }
}
{code}
I send a SIGTERM signal to stop the spark streaming, but exception arrises:
{noformat}
2020-10-12 14:12:29 ERROR Inbox - Ignoring error
java.util.concurrent.RejectedExecutionException: Task 
org.apache.spark.executor.Executor$TaskRunner@68a46d91 rejected from 
java.util.concurrent.ThreadPoolExecutor@2b13eeda[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 6]
at 

[jira] [Commented] (SPARK-33121) Spark does not shutdown gracefully

2021-04-22 Thread Dmitry Tverdokhleb (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329129#comment-17329129
 ] 

Dmitry Tverdokhleb commented on SPARK-33121:


Tested on Spark Streaming 3.1.1 with a simple example.

> Spark does not shutdown gracefully
> --
>
> Key: SPARK-33121
> URL: https://issues.apache.org/jira/browse/SPARK-33121
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.0.1
>Reporter: Dmitry Tverdokhleb
>Priority: Major
>
> Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem 
> in graceful shutdown.
> Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".
> Here is the code:
> {code:java}
> inputStream.foreachRDD {
>   rdd =>
> rdd.foreachPartition {
> Thread.sleep(5000)
> }
> }
> {code}
> I send a SIGTERM signal to stop the spark streaming and after sleeping an 
> exception arises:
> {noformat}
> streaming-agg-tds-data_1  | java.util.concurrent.RejectedExecutionException: 
> Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from 
> java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach$(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach(IterableLike.scala:74)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> streaming-agg-tds-data_1  | at java.lang.Thread.run(Thread.java:748)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 WARN  JobGenerator - Timed 
> out while stopping the job generator (timeout = 1)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Waited 
> for jobs to be processed and checkpoints to be written
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Stopped 
> JobGenerator{noformat}
> After this exception and "JobGenerator - Stopped JobGenerator" log, streaming 
> freezes, and halts by timeout (Config parameter 
> "hadoop.service.shutdown.timeout").
> Besides, there is no problem with the graceful shutdown in spark 2.4.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33121) Spark does not shutdown gracefully

2021-04-22 Thread Dmitry Tverdokhleb (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Description: 
Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd.foreachPartition {
Thread.sleep(5000)
}
}
{code}
I send a SIGTERM signal to stop the spark streaming and after sleeping an 
exception arises:
{noformat}
streaming-agg-tds-data_1  | java.util.concurrent.RejectedExecutionException: 
Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from 
java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 1]
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
streaming-agg-tds-data_1  | at 
org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
streaming-agg-tds-data_1  | at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
streaming-agg-tds-data_1  | at 
org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
streaming-agg-tds-data_1  | at 
scala.collection.Iterator.foreach(Iterator.scala:941)
streaming-agg-tds-data_1  | at 
scala.collection.Iterator.foreach$(Iterator.scala:941)
streaming-agg-tds-data_1  | at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
streaming-agg-tds-data_1  | at 
scala.collection.IterableLike.foreach(IterableLike.scala:74)
streaming-agg-tds-data_1  | at 
scala.collection.IterableLike.foreach$(IterableLike.scala:73)
streaming-agg-tds-data_1  | at 
scala.collection.AbstractIterable.foreach(Iterable.scala:56)
streaming-agg-tds-data_1  | at 
org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
streaming-agg-tds-data_1  | at 
org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
streaming-agg-tds-data_1  | at 
org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
streaming-agg-tds-data_1  | at 
org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
streaming-agg-tds-data_1  | at 
org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
streaming-agg-tds-data_1  | at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
streaming-agg-tds-data_1  | at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
streaming-agg-tds-data_1  | at java.lang.Thread.run(Thread.java:748)
streaming-agg-tds-data_1  | 2021-04-22 13:33:41 WARN  JobGenerator - Timed out 
while stopping the job generator (timeout = 1)
streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Waited for 
jobs to be processed and checkpoints to be written
streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Stopped 
JobGenerator{noformat}
After this exception and "JobGenerator - Stopped JobGenerator" log, streaming 
freezes, and halts by timeout (Config parameter 
"hadoop.service.shutdown.timeout").

Besides, there is no problem with the graceful shutdown in spark 2.4.5.

  was:
Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem in 
graceful shutdown.

Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".

Here is the code:
{code:java}
inputStream.foreachRDD {
  rdd =>
rdd.foreachPartition {
Thread.sleep(5000)
}
}
{code}
I send a SIGTERM signal to stop the spark streaming and after sleeping an 
exception arises:
{noformat}
streaming-agg-tds-data_1  | java.util.concurrent.RejectedExecutionException: 
Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from 
java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, 
active threads = 0, queued tasks = 0, completed tasks = 1]
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
streaming-agg-tds-data_1  | at 
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
streaming-agg-tds-data_1  | at 

[jira] [Updated] (SPARK-33121) Spark Streaming 3.1.1 hangs on shutdown

2021-04-22 Thread Dmitry Tverdokhleb (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Affects Version/s: (was: 3.0.1)
   3.1.1

> Spark Streaming 3.1.1 hangs on shutdown
> ---
>
> Key: SPARK-33121
> URL: https://issues.apache.org/jira/browse/SPARK-33121
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.1.1
>Reporter: Dmitry Tverdokhleb
>Priority: Major
>
> Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem 
> in graceful shutdown.
> Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".
> Here is the code:
> {code:java}
> inputStream.foreachRDD {
>   rdd =>
> rdd.foreachPartition {
> Thread.sleep(5000)
> }
> }
> {code}
> I send a SIGTERM signal to stop the spark streaming and after sleeping an 
> exception arises:
> {noformat}
> streaming-agg-tds-data_1  | java.util.concurrent.RejectedExecutionException: 
> Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from 
> java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach$(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach(IterableLike.scala:74)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> streaming-agg-tds-data_1  | at java.lang.Thread.run(Thread.java:748)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 WARN  JobGenerator - Timed 
> out while stopping the job generator (timeout = 1)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Waited 
> for jobs to be processed and checkpoints to be written
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Stopped 
> JobGenerator{noformat}
> After this exception and "JobGenerator - Stopped JobGenerator" log, streaming 
> freezes, and halts by timeout (Config parameter 
> "hadoop.service.shutdown.timeout").
> Besides, there is no problem with the graceful shutdown in spark 2.4.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35192) Port minimal TPC-DS datagen code from databricks/spark-sql-perf

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329107#comment-17329107
 ] 

Apache Spark commented on SPARK-35192:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/32243

> Port minimal TPC-DS datagen code from databricks/spark-sql-perf
> ---
>
> Key: SPARK-35192
> URL: https://issues.apache.org/jira/browse/SPARK-35192
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This PR aims at porting minimal code to generate TPC-DS data from 
> databricks/spark-sql-perf. The classes in a new class file tpcdsDatagen.scala 
> are basically copied from the databricks/spark-sql-perf codebase.
> We frequently use TPCDS data now for benchmarks/tests, but the classes for 
> the TPCDS schemas of datagen and benchmarks/tests are managed separately, 
> e.g.,
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
> https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala
> I think this causes some inconveniences, e.g., we need to update both files 
> in the separate repositories if we update the TPCDS schema #32037. So, it 
> would be useful for the Spark codebase to generate them by referring to the 
> same schema definition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34883) Setting CSV reader option "multiLine" to "true" causes URISyntaxException when colon is in file path

2021-04-22 Thread Brady Tello (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329132#comment-17329132
 ] 

Brady Tello commented on SPARK-34883:
-

[~Vikas_Yadav]

You don't have a colon in your `inputFile` path.  See the following code that 
contains a colon in the path to the dataset.  It fails with a 
URISyntaxException as expected:
{code:java}
>>> inputFile = "/Users/home_dir/Workspaces/datasets/with:colon/iris.csv" 
>>> tempDF = spark.read.csv(inputFile, multiLine=True) 
Traceback (most recent call last): File "", line 1, in  File 
"/Users/brady.tello/Workspaces/spark/python/pyspark/sql/readwriter.py", line 
737, in csv return 
self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) File 
"/Users/brady.tello/Workspaces/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py",
 line 1304, in __call__ File 
"/Users/brady.tello/Workspaces/spark/python/pyspark/sql/utils.py", line 117, in 
deco raise converted from None pyspark.sql.utils.IllegalArgumentException: 
java.net.URISyntaxException: Relative path in absolute URI: with:colon
{code}

> Setting CSV reader option "multiLine" to "true" causes URISyntaxException 
> when colon is in file path
> 
>
> Key: SPARK-34883
> URL: https://issues.apache.org/jira/browse/SPARK-34883
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0, 3.1.1
>Reporter: Brady Tello
>Priority: Major
>
> Setting the CSV reader's "multiLine" option to "True" throws the following 
> exception when a ':' character is in the file path.
>  
> {code:java}
> java.net.URISyntaxException: Relative path in absolute URI: test:dir
> {code}
> I've tested this in both Spark 3.0.0 and Spark 3.1.1 and I get the same error 
> whether I use Scala, Python, or SQL.
> The following code works fine:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> tempDF = (spark.read.option("sep", "\t").csv(csvFile)
> {code}
> While the following code fails:
>  
> {code:java}
> csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv"
> tempDF = (spark.read.option("sep", "\t").option("multiLine", 
> "True").csv(csvFile)
> {code}
> Full Stack Trace from Python:
>  
> {code:java}
> --- 
> IllegalArgumentException Traceback (most recent call last)  
> in  
> 3 csvFile = "/FileStore/myDir/test:dir/pageviews_by_second.tsv" 
> 4 
> > 5  tempDF = (spark.read.option("sep", "\t").option("multiLine", "True") 
> /databricks/spark/python/pyspark/sql/readwriter.py in csv(self, path, schema, 
> sep, encoding, quote, escape, comment, header, inferSchema, 
> ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, 
> positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, 
> maxCharsPerColumn, maxMalformedLogPerPartition, mode, 
> columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, 
> samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, 
> recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling) 
> 735 path = [path] 
> 736 if type(path) == list: 
> --> 737 return 
> self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path))) 
> 738 elif isinstance(path, RDD): 
> 739 def func(iterator): 
> /databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in 
> __call__(self, *args) 
> 1302 
> 1303 answer = self.gateway_client.send_command(command) 
> -> 1304 return_value = get_return_value( 
> 1305 answer, self.gateway_client, self.target_id, self.name) 
> 1306 
> /databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 
> 114 # Hide where the exception came from that shows a non-Pythonic 
> 115 # JVM exception message. 
> --> 116 raise converted from None 
> 117 else: 
> 118 raise IllegalArgumentException: java.net.URISyntaxException: Relative 
> path in absolute URI: test:dir
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33121) Spark Streaming 3.1.1 hangs on shutdown

2021-04-22 Thread Dmitry Tverdokhleb (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Tverdokhleb updated SPARK-33121:
---
Labels: 3.1.1 Streaming hang shutdown  (was: )

> Spark Streaming 3.1.1 hangs on shutdown
> ---
>
> Key: SPARK-33121
> URL: https://issues.apache.org/jira/browse/SPARK-33121
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 3.1.1
>Reporter: Dmitry Tverdokhleb
>Priority: Critical
>  Labels: 3.1.1, Streaming, hang, shutdown
>
> Hi. I am trying to migrate from spark 2.4.5 to 3.1.1 and there is a problem 
> in graceful shutdown.
> Config parameter "spark.streaming.stopGracefullyOnShutdown" is set as "true".
> Here is the code:
> {code:java}
> inputStream.foreachRDD {
>   rdd =>
> rdd.foreachPartition {
> Thread.sleep(5000)
> }
> }
> {code}
> I send a SIGTERM signal to stop the spark streaming and after sleeping an 
> exception arises:
> {noformat}
> streaming-agg-tds-data_1  | java.util.concurrent.RejectedExecutionException: 
> Task org.apache.spark.executor.Executor$TaskRunner@7ca7f0b8 rejected from 
> java.util.concurrent.ThreadPoolExecutor@2474219c[Terminated, pool size = 0, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2063)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:830)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1379)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.executor.Executor.launchTask(Executor.scala:270)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1(LocalSchedulerBackend.scala:93)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.$anonfun$reviveOffers$1$adapted(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.Iterator.foreach$(Iterator.scala:941)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach(IterableLike.scala:74)
> streaming-agg-tds-data_1  | at 
> scala.collection.IterableLike.foreach$(IterableLike.scala:73)
> streaming-agg-tds-data_1  | at 
> scala.collection.AbstractIterable.foreach(Iterable.scala:56)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint.reviveOffers(LocalSchedulerBackend.scala:91)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.scheduler.local.LocalEndpoint$$anonfun$receive$1.applyOrElse(LocalSchedulerBackend.scala:68)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
> streaming-agg-tds-data_1  | at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> streaming-agg-tds-data_1  | at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> streaming-agg-tds-data_1  | at java.lang.Thread.run(Thread.java:748)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 WARN  JobGenerator - Timed 
> out while stopping the job generator (timeout = 1)
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Waited 
> for jobs to be processed and checkpoints to be written
> streaming-agg-tds-data_1  | 2021-04-22 13:33:41 INFO  JobGenerator - Stopped 
> JobGenerator{noformat}
> After this exception and "JobGenerator - Stopped JobGenerator" log, streaming 
> freezes, and halts by timeout (Config parameter 
> "hadoop.service.shutdown.timeout").
> Besides, there is no problem with the graceful shutdown in spark 2.4.5.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35187) Failure on minimal interval literal

2021-04-22 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329116#comment-17329116
 ] 

Apache Spark commented on SPARK-35187:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/32296

> Failure on minimal interval literal
> ---
>
> Key: SPARK-35187
> URL: https://issues.apache.org/jira/browse/SPARK-35187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> If the sign '-' inside of interval string, everything is fine after 
> https://github.com/apache/spark/commit/bb5459fb26b9d0d57eadee8b10b7488607eeaeb2:
> {code:sql}
> spark-sql> SELECT INTERVAL '-178956970-8' YEAR TO MONTH;
> -178956970-8
> {code}
> but the sign outside of interval string is not handled properly:
> {code:sql}
> spark-sql> SELECT INTERVAL -'178956970-8' YEAR TO MONTH;
> Error in query:
> Error parsing interval year-month string: integer overflow(line 1, pos 16)
> == SQL ==
> SELECT INTERVAL -'178956970-8' YEAR TO MONTH
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32380) sparksql cannot access hive table while data in hbase

2021-04-22 Thread Yennam ShowryPremSagar Reddy (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17329119#comment-17329119
 ] 

Yennam ShowryPremSagar Reddy commented on SPARK-32380:
--

[~meimile] Will there be any permanent fix for this issues spark3 and HIve on 
top of Hbase tables access

> sparksql cannot access hive table while data in hbase
> -
>
> Key: SPARK-32380
> URL: https://issues.apache.org/jira/browse/SPARK-32380
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: ||component||version||
> |hadoop|2.8.5|
> |hive|2.3.7|
> |spark|3.0.0|
> |hbase|1.4.9|
>Reporter: deyzhong
>Priority: Major
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>
> * step1: create hbase table
> {code:java}
>  hbase(main):001:0>create 'hbase_test1', 'cf1'
>  hbase(main):001:0> put 'hbase_test', 'r1', 'cf1:c1', '123'
> {code}
>  * step2: create hive table related to hbase table
>  
> {code:java}
> hive> 
> CREATE EXTERNAL TABLE `hivetest.hbase_test`(
>   `key` string COMMENT '', 
>   `value` string COMMENT '')
> ROW FORMAT SERDE 
>   'org.apache.hadoop.hive.hbase.HBaseSerDe' 
> STORED BY 
>   'org.apache.hadoop.hive.hbase.HBaseStorageHandler' 
> WITH SERDEPROPERTIES ( 
>   'hbase.columns.mapping'=':key,cf1:v1', 
>   'serialization.format'='1')
> TBLPROPERTIES (
>   'hbase.table.name'='hbase_test')
>  {code}
>  * step3: sparksql query hive table while data in hbase
> {code:java}
> spark-sql --master yarn -e "select * from hivetest.hbase_test"
> {code}
>  
> The error log as follow: 
> java.io.IOException: Cannot create a record reader because of a previous 
> error. Please look at the previous logs lines from the task's full log for 
> more details.
>  at 
> org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:270)
>  at org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:131)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at 
> org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49)
>  at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:276)
>  at scala.Option.getOrElse(Option.scala:189)
>  at org.apache.spark.rdd.RDD.partitions(RDD.scala:272)
>  at org.apache.spark.SparkContext.runJob(SparkContext.scala:2158)
>  at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1004)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>  at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
>  at org.apache.spark.rdd.RDD.collect(RDD.scala:1003)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:385)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeCollectPublic(SparkPlan.scala:412)
>  at 
> org.apache.spark.sql.execution.HiveResult$.hiveResultString(HiveResult.scala:58)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.$anonfun$run$1(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:65)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
>  at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
>  at scala.collection.Iterator.foreach(Iterator.scala:941)
>  at 

[jira] [Assigned] (SPARK-35192) Port minimal TPC-DS datagen code from databricks/spark-sql-perf

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35192:


Assignee: Apache Spark

> Port minimal TPC-DS datagen code from databricks/spark-sql-perf
> ---
>
> Key: SPARK-35192
> URL: https://issues.apache.org/jira/browse/SPARK-35192
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> This PR aims at porting minimal code to generate TPC-DS data from 
> databricks/spark-sql-perf. The classes in a new class file tpcdsDatagen.scala 
> are basically copied from the databricks/spark-sql-perf codebase.
> We frequently use TPCDS data now for benchmarks/tests, but the classes for 
> the TPCDS schemas of datagen and benchmarks/tests are managed separately, 
> e.g.,
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
> https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala
> I think this causes some inconveniences, e.g., we need to update both files 
> in the separate repositories if we update the TPCDS schema #32037. So, it 
> would be useful for the Spark codebase to generate them by referring to the 
> same schema definition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35026) Support use CUBE/ROLLUP in GROUPING SETS

2021-04-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35026:
---

Assignee: angerszhu

> Support use CUBE/ROLLUP in GROUPING SETS
> 
>
> Key: SPARK-35026
> URL: https://issues.apache.org/jira/browse/SPARK-35026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png|width=924,height=463!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35026) Support use CUBE/ROLLUP in GROUPING SETS

2021-04-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35026.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32201
[https://github.com/apache/spark/pull/32201]

> Support use CUBE/ROLLUP in GROUPING SETS
> 
>
> Key: SPARK-35026
> URL: https://issues.apache.org/jira/browse/SPARK-35026
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: screenshot-1.png
>
>
> !screenshot-1.png|width=924,height=463!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35192) Port minimal TPC-DS datagen code from databricks/spark-sql-perf

2021-04-22 Thread Takeshi Yamamuro (Jira)
Takeshi Yamamuro created SPARK-35192:


 Summary: Port minimal TPC-DS datagen code from 
databricks/spark-sql-perf
 Key: SPARK-35192
 URL: https://issues.apache.org/jira/browse/SPARK-35192
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.2.0
Reporter: Takeshi Yamamuro


This PR aims at porting minimal code to generate TPC-DS data from 
databricks/spark-sql-perf. The classes in a new class file tpcdsDatagen.scala 
are basically copied from the databricks/spark-sql-perf codebase.

We frequently use TPCDS data now for benchmarks/tests, but the classes for the 
TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g.,

https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala
I think this causes some inconveniences, e.g., we need to update both files in 
the separate repositories if we update the TPCDS schema #32037. So, it would be 
useful for the Spark codebase to generate them by referring to the same schema 
definition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35183) CombineConcats should call transformAllExpressions

2021-04-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35183:
---

Assignee: Yingyi Bu

> CombineConcats should call transformAllExpressions
> --
>
> Key: SPARK-35183
> URL: https://issues.apache.org/jira/browse/SPARK-35183
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Minor
>
> {code:java}
> plan transformExpressions { ... }{code}
> only applies the transformation node `plan` itself, but not its children. We 
> should call transformAllExpressions instead of transformExpressions in 
> CombineConcats. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35187) Failure on minimal interval literal

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35187:


Assignee: Apache Spark

> Failure on minimal interval literal
> ---
>
> Key: SPARK-35187
> URL: https://issues.apache.org/jira/browse/SPARK-35187
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> If the sign '-' inside of interval string, everything is fine after 
> https://github.com/apache/spark/commit/bb5459fb26b9d0d57eadee8b10b7488607eeaeb2:
> {code:sql}
> spark-sql> SELECT INTERVAL '-178956970-8' YEAR TO MONTH;
> -178956970-8
> {code}
> but the sign outside of interval string is not handled properly:
> {code:sql}
> spark-sql> SELECT INTERVAL -'178956970-8' YEAR TO MONTH;
> Error in query:
> Error parsing interval year-month string: integer overflow(line 1, pos 16)
> == SQL ==
> SELECT INTERVAL -'178956970-8' YEAR TO MONTH
> ^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35192) Port minimal TPC-DS datagen code from databricks/spark-sql-perf

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35192:


Assignee: (was: Apache Spark)

> Port minimal TPC-DS datagen code from databricks/spark-sql-perf
> ---
>
> Key: SPARK-35192
> URL: https://issues.apache.org/jira/browse/SPARK-35192
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> This PR aims at porting minimal code to generate TPC-DS data from 
> databricks/spark-sql-perf. The classes in a new class file tpcdsDatagen.scala 
> are basically copied from the databricks/spark-sql-perf codebase.
> We frequently use TPCDS data now for benchmarks/tests, but the classes for 
> the TPCDS schemas of datagen and benchmarks/tests are managed separately, 
> e.g.,
> https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala
> https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala
> I think this causes some inconveniences, e.g., we need to update both files 
> in the separate repositories if we update the TPCDS schema #32037. So, it 
> would be useful for the Spark codebase to generate them by referring to the 
> same schema definition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35183) CombineConcats should call transformAllExpressions

2021-04-22 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35183.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32290
[https://github.com/apache/spark/pull/32290]

> CombineConcats should call transformAllExpressions
> --
>
> Key: SPARK-35183
> URL: https://issues.apache.org/jira/browse/SPARK-35183
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.1.0
>Reporter: Yingyi Bu
>Assignee: Yingyi Bu
>Priority: Minor
> Fix For: 3.2.0
>
>
> {code:java}
> plan transformExpressions { ... }{code}
> only applies the transformation node `plan` itself, but not its children. We 
> should call transformAllExpressions instead of transformExpressions in 
> CombineConcats. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35180) Allow to build SparkR with SBT

2021-04-22 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-35180.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Fixed in [https://github.com/apache/spark/pull/32285]

> Allow to build SparkR with SBT
> --
>
> Key: SPARK-35180
> URL: https://issues.apache.org/jira/browse/SPARK-35180
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> In the current master, SparkR can be built only with Maven.
> It's helpful if we can built it with SBT.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35127) When we switch between different stage-detail pages, the entry item in the newly-opened page may be blank.

2021-04-22 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-35127.

Fix Version/s: 3.2.0
   3.1.2
   3.0.3
 Assignee: akiyamaneko
   Resolution: Fixed

This issue was resolved in https://github.com/apache/spark/pull/32223.

> When we switch between different stage-detail pages, the entry item in the 
> newly-opened  page may be blank.
> ---
>
> Key: SPARK-35127
> URL: https://issues.apache.org/jira/browse/SPARK-35127
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.1
>Reporter: akiyamaneko
>Assignee: akiyamaneko
>Priority: Minor
> Fix For: 3.0.3, 3.1.2, 3.2.0
>
> Attachments: first-step.png, second-step.png, stages-list.png
>
>
> spark: 3.0.1
> reproduce step:
> Assuming that we currently have two stages as shown below, note that the two 
> stages contain different numbers of task(attrached as stages-list.png)
>  # click the task with id 0, jump to stage-detail page, and then switch 
> entries item to *All* option (attrached as first-step.png)
>  #  then click the task with id 1, jump to stage-detail page, we will find 
> that the entries shows empty. (attrached as second-step.png)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35090) Extract a field from ANSI interval

2021-04-22 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328043#comment-17328043
 ] 

Kent Yao commented on SPARK-35090:
--

yes. thanks

> Extract a field from ANSI interval
> --
>
> Key: SPARK-35090
> URL: https://issues.apache.org/jira/browse/SPARK-35090
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Support field extraction from year-month and day-time intervals. For example:
> {code:sql}
>   > SELECT extract(days FROM INTERVAL '5 0:0:0' day to second);
>5
>   > SELECT extract(seconds FROM INTERVAL '10 11:12:30.001001' DAY TO 
> SECOND);
>30.001001
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35091) Support ANSI intervals by date_part()

2021-04-22 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17328019#comment-17328019
 ] 

Kent Yao commented on SPARK-35091:
--

thanks [~maxgekk] for pinging me. I will take it.

> Support ANSI intervals by date_part()
> -
>
> Key: SPARK-35091
> URL: https://issues.apache.org/jira/browse/SPARK-35091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Support year-month and day-time intervals by date_part(). For example:
> {code:sql}
>   > SELECT date_part('days', interval '5 0:0:0' day to second);
>5
>   > SELECT date_part('seconds', interval '10 11:12:30.001001' DAY TO 
> SECOND);
>30.001001
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35090) Extract a field from ANSI interval

2021-04-22 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327637#comment-17327637
 ] 

Max Gekk commented on SPARK-35090:
--

[~Qin Yao] Would you like to take this?

> Extract a field from ANSI interval
> --
>
> Key: SPARK-35090
> URL: https://issues.apache.org/jira/browse/SPARK-35090
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Support field extraction from year-month and day-time intervals. For example:
> {code:sql}
>   > SELECT extract(days FROM INTERVAL '5 0:0:0' day to second);
>5
>   > SELECT extract(seconds FROM INTERVAL '10 11:12:30.001001' DAY TO 
> SECOND);
>30.001001
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35191) all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

2021-04-22 Thread xiaoli (Jira)
xiaoli created SPARK-35191:
--

 Summary: all columns are read even if column pruning applies when 
spark3.0 read table written by spark2.2
 Key: SPARK-35191
 URL: https://issues.apache.org/jira/browse/SPARK-35191
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 3.0.0
 Environment: spark3.0

spark.sql.hive.convertMetastoreOrc=true(default value in spark3.0)

spark.sql.orc.impl=native(default value in spark3.0)
Reporter: xiaoli


Before I address this issue, let me talk about the issue background: The 
current spark version we use is 2.2, and we plan to migrate to spark3.0 in near 
future. Before migration, we test some query in both spark2.2 and spark3.0 to 
check potential issue. The data source table of these query is orc format 
written by spark2.2.

 

I find that even if column pruning is applied, spark3.0’s native reader will 
read all columns.

 

Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it will 
check whether field name is started with “_col”. In my case, field name is 
started with “_col”, like “_col1”, “_col2”. So pruneCols is not done.  The code 
is below:

 

if (orcFieldNames.forall(_.startsWith("_col"))) {

  // This is a ORC file written by Hive, no field names in the physical schema, 
assume the

  // physical schema maps to the data scheme by index.

  _assert_(orcFieldNames.length <= dataSchema.length, "The given data schema " +

    s"*$*{dataSchema.catalogString} has less fields than the actual ORC 
physical schema, " +

    "no idea which columns were dropped, fail to read.")

  // for ORC file written by Hive, no field names

  // in the physical schema, there is a need to send the

  // entire dataSchema instead of required schema.

  // So pruneCols is not done in this case

  Some(requiredSchema.fieldNames.map { name =>

    val index = dataSchema.fieldIndex(name)

    if (index < orcFieldNames.length) {

      index

    } else {

      -1

    }

  }, false)
 
 Although this code comment explains reason, I still do not understand. This 
issue only happens in this case: spark3.0 uses native reader to read table 
written by spark2.2. 

 

In other cases, there is no such issue. I do another 2 tests:

Test1: use spark3.0’s hive reader (running with 
spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read 
the same table, it only reads pruned columns.

Test2: use spark3.0 to write a table, then use spark3.0’s native reader to read 
this new table, it only reads pruned columns.

 

This issue I mentioned is a block we use native reader in spark3.0. Can anyone 
know further reason or provide solutions?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35091) Support ANSI intervals by date_part()

2021-04-22 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17327644#comment-17327644
 ] 

Max Gekk commented on SPARK-35091:
--

[~Qin Yao] Would you like to take this?

> Support ANSI intervals by date_part()
> -
>
> Key: SPARK-35091
> URL: https://issues.apache.org/jira/browse/SPARK-35091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Support year-month and day-time intervals by date_part(). For example:
> {code:sql}
>   > SELECT date_part('days', interval '5 0:0:0' day to second);
>5
>   > SELECT date_part('seconds', interval '10 11:12:30.001001' DAY TO 
> SECOND);
>30.001001
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35190) all columns are read even if column pruning applies when spark3.0 read table written by spark2.2

2021-04-22 Thread xiaoli (Jira)
xiaoli created SPARK-35190:
--

 Summary: all columns are read even if column pruning applies when 
spark3.0 read table written by spark2.2
 Key: SPARK-35190
 URL: https://issues.apache.org/jira/browse/SPARK-35190
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 3.0.0
 Environment: spark3.0

set spark.sql.hive.convertMetastoreOrc=true (default value in spark3.0)

set spark.sql.orc.impl=native(default velue in spark3.0)
Reporter: xiaoli


Before I address this issue, let me talk about the issue background: The 
current spark version we use is 2.2, and we plan to migrate to spark3.0 in near 
future. Before migration, we test some query in both spark2.2 and spark3.0 to 
check potential issue. The data source table of these query is orc format 
written by spark2.2.

 

I find that even if column pruning is applied, spark3.0’s native reader will 
read all columns.

 

Then I do remote debug. In OrcUtils.scala’s requestedColumnIds Method, it will 
check whether field name is started with “_col”. In my case, field name is 
started with “_col”, like “_col1”, “_col2”. So pruneCols is not done.  The code 
is below:

 

if (orcFieldNames.forall(_.startsWith("_col"))) {

  // This is a ORC file written by Hive, no field names in the physical schema, 
assume the

  // physical schema maps to the data scheme by index.

  _assert_(orcFieldNames.length <= dataSchema.length, "The given data schema " +

    s"*$*{dataSchema.catalogString} has less fields than the actual ORC 
physical schema, " +

    "no idea which columns were dropped, fail to read.")

  // for ORC file written by Hive, no field names

  // in the physical schema, there is a need to send the

  // entire dataSchema instead of required schema.

  // So pruneCols is not done in this case

  Some(requiredSchema.fieldNames.map { name =>

    val index = dataSchema.fieldIndex(name)

    if (index < orcFieldNames.length) {

      index

    } else {

      -1

    }

  }, false)
 
 Although this code comment explains reason, I still do not understand. This 
issue only happens in this case: spark3.0 uses native reader to read table 
written by spark2.2. 

 

In other cases, there is no such issue. I do another 2 tests:

Test1: use spark3.0’s hive reader (running with 
spark.sql.hive.convertMetastoreOrc=false and spark.sql.orc.impl=hive) to read 
the same table, it only reads pruned columns.

Test2: use spark3.0 to write a table, then use spark3.0’s native reader to read 
this new table, it only reads pruned columns.

 

This issue I mentioned is a block we use native reader in spark3.0. Can anyone 
know further reason or provide solutions?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35189) Disallow implicit conversion for `String_Column - Date_Column`

2021-04-22 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-35189:


Assignee: Apache Spark  (was: Gengliang Wang)

> Disallow implicit conversion for `String_Column - Date_Column`
> --
>
> Key: SPARK-35189
> URL: https://issues.apache.org/jira/browse/SPARK-35189
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> In Spark 3.2, `String_Column - Date_Column` will cause an analysis exception 
> instead of implicitly converting the first column as Date Type. This is to 
> make it consistent with the behavior of `String_Column - Timestamp_Column`. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >