date:20200606

[jira] [Updated] (SPARK-31924) Create remote shuffle service reference implementation

2020-06-06 Thread BoYang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

BoYang updated SPARK-31924:
---
Description: 
People in [Spark Scalability & Reliability Sync Meeting 
|https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have
 discussed a lot about remote (disaggregated) shuffle service, and plan to do a 
reference implementation to help demonstrate some basic design and pave the way 
for a future production grade remote  shuffle service.

 

There are already two pull requests to enhance Spark shuffle metadata API to 
make it easy/possible to implement remote shuffle service ([PR 
28616|https://github.com/apache/spark/pull/28616], [PR 
28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
service reference implementation will help to validate those shuffle metadata 
API.

 

  was:
People in [Spark Scalability & Reliability Sync 
Meeting|[https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]]
 have discussed a lot about remote (disaggregated) shuffle service, and plan to 
do a reference implementation to help demonstrate some basic design and pave 
the way for a future production grade remote  shuffle service.

 

There are already two pull requests to enhance Spark shuffle metadata API to 
make it easy/possible to implement remote shuffle service ([PR 
28616|https://github.com/apache/spark/pull/28616], [PR 
28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
service reference implementation will help to validate those shuffle metadata 
API.

 


> Create remote shuffle service reference implementation
> --
>
> Key: SPARK-31924
> URL: https://issues.apache.org/jira/browse/SPARK-31924
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: BoYang
>Priority: Major
> Fix For: 3.0.0
>
>
> People in [Spark Scalability & Reliability Sync Meeting 
> |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have
>  discussed a lot about remote (disaggregated) shuffle service, and plan to do 
> a reference implementation to help demonstrate some basic design and pave the 
> way for a future production grade remote  shuffle service.
>  
> There are already two pull requests to enhance Spark shuffle metadata API to 
> make it easy/possible to implement remote shuffle service ([PR 
> 28616|https://github.com/apache/spark/pull/28616], [PR 
> 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
> service reference implementation will help to validate those shuffle metadata 
> API.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31924) Create remote shuffle service reference implementation

2020-06-06 Thread BoYang (Jira)

BoYang created SPARK-31924:
--

 Summary: Create remote shuffle service reference implementation
 Key: SPARK-31924
 URL: https://issues.apache.org/jira/browse/SPARK-31924
 Project: Spark
  Issue Type: New Feature
  Components: Shuffle
Affects Versions: 3.0.0
Reporter: BoYang
 Fix For: 3.0.0


People in [Spark Scalability & Reliability Sync 
Meeting|[https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]]
 have discussed a lot about remote (disaggregated) shuffle service, and plan to 
do a reference implementation to help demonstrate some basic design and pave 
the way for a future production grade remote  shuffle service.

 

There are already two pull requests to enhance Spark shuffle metadata API to 
make it easy/possible to implement remote shuffle service ([PR 
28616|https://github.com/apache/spark/pull/28616], [PR 
28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle 
service reference implementation will help to validate those shuffle metadata 
API.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30119) Support pagination for spark streaming tab

2020-06-06 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta resolved SPARK-30119.

Fix Version/s: 3.1.0
   Resolution: Fixed

This issue is resolved in https://github.com/apache/spark/pull/28439

> Support pagination for  spark streaming tab
> ---
>
> Key: SPARK-30119
> URL: https://issues.apache.org/jira/browse/SPARK-30119
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Assignee: Rakesh Raushan
>Priority: Minor
> Fix For: 3.1.0
>
>
> Support pagination for spark streaming tab



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30119) Support pagination for spark streaming tab

2020-06-06 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta reassigned SPARK-30119:
--

Assignee: Rakesh Raushan

> Support pagination for  spark streaming tab
> ---
>
> Key: SPARK-30119
> URL: https://issues.apache.org/jira/browse/SPARK-30119
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: jobit mathew
>Assignee: Rakesh Raushan
>Priority: Minor
>
> Support pagination for spark streaming tab



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2020-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127444#comment-17127444
 ] 

Apache Spark commented on SPARK-31923:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/28744

> Event log cannot be generated when some internal accumulators use unexpected 
> types
> --
>
> Key: SPARK-31923
> URL: https://issues.apache.org/jira/browse/SPARK-31923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: Shixiong Zhu
>Priority: Major
>
> A user may use internal accumulators by adding the "internal.metrics." prefix 
> to the accumulator name to hide sensitive information from UI (Accumulators 
> except internal ones will be shown in Spark UI).
> However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an 
> internal accumulator has only 3 possible types: int, long, and 
> java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an 
> unexpected type, it will crash.
> An event log that contains such accumulator will be dropped because it cannot 
> be converted to JSON, and it will cause weird UI issue when rendering in 
> Spark History Server. For example, if `SparkListenerTaskEnd` is dropped 
> because of this issue, the user will see the task is still running even if it 
> was finished.
> It's better to make *accumValueToJson* more robust.
> 
> How to reproduce it:
> - Enable Spark event log
> - Run the following command:
> {code}
> scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
> accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, 
> name: Some(internal.metrics.foo), value: 0.0)
> scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) }
> 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.lang.ClassCastException: java.lang.Double cannot be cast to 
> java.util.List
>   at 
> org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at scala.collection.immutable.List.map(List.scala:284)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291)
>   at 
> org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
>   at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.sc

[jira] [Commented] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2020-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127443#comment-17127443
 ] 

Apache Spark commented on SPARK-31923:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/28744

> Event log cannot be generated when some internal accumulators use unexpected 
> types
> --
>
> Key: SPARK-31923
> URL: https://issues.apache.org/jira/browse/SPARK-31923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: Shixiong Zhu
>Priority: Major
>
> A user may use internal accumulators by adding the "internal.metrics." prefix 
> to the accumulator name to hide sensitive information from UI (Accumulators 
> except internal ones will be shown in Spark UI).
> However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an 
> internal accumulator has only 3 possible types: int, long, and 
> java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an 
> unexpected type, it will crash.
> An event log that contains such accumulator will be dropped because it cannot 
> be converted to JSON, and it will cause weird UI issue when rendering in 
> Spark History Server. For example, if `SparkListenerTaskEnd` is dropped 
> because of this issue, the user will see the task is still running even if it 
> was finished.
> It's better to make *accumValueToJson* more robust.
> 
> How to reproduce it:
> - Enable Spark event log
> - Run the following command:
> {code}
> scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
> accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, 
> name: Some(internal.metrics.foo), value: 0.0)
> scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) }
> 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.lang.ClassCastException: java.lang.Double cannot be cast to 
> java.util.List
>   at 
> org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at scala.collection.immutable.List.map(List.scala:284)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291)
>   at 
> org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
>   at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.sc

[jira] [Assigned] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2020-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31923:


Assignee: Apache Spark

> Event log cannot be generated when some internal accumulators use unexpected 
> types
> --
>
> Key: SPARK-31923
> URL: https://issues.apache.org/jira/browse/SPARK-31923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> A user may use internal accumulators by adding the "internal.metrics." prefix 
> to the accumulator name to hide sensitive information from UI (Accumulators 
> except internal ones will be shown in Spark UI).
> However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an 
> internal accumulator has only 3 possible types: int, long, and 
> java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an 
> unexpected type, it will crash.
> An event log that contains such accumulator will be dropped because it cannot 
> be converted to JSON, and it will cause weird UI issue when rendering in 
> Spark History Server. For example, if `SparkListenerTaskEnd` is dropped 
> because of this issue, the user will see the task is still running even if it 
> was finished.
> It's better to make *accumValueToJson* more robust.
> 
> How to reproduce it:
> - Enable Spark event log
> - Run the following command:
> {code}
> scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
> accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, 
> name: Some(internal.metrics.foo), value: 0.0)
> scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) }
> 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.lang.ClassCastException: java.lang.Double cannot be cast to 
> java.util.List
>   at 
> org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at scala.collection.immutable.List.map(List.scala:284)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291)
>   at 
> org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
>   at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-

[jira] [Assigned] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2020-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31923:


Assignee: (was: Apache Spark)

> Event log cannot be generated when some internal accumulators use unexpected 
> types
> --
>
> Key: SPARK-31923
> URL: https://issues.apache.org/jira/browse/SPARK-31923
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6
>Reporter: Shixiong Zhu
>Priority: Major
>
> A user may use internal accumulators by adding the "internal.metrics." prefix 
> to the accumulator name to hide sensitive information from UI (Accumulators 
> except internal ones will be shown in Spark UI).
> However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an 
> internal accumulator has only 3 possible types: int, long, and 
> java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an 
> unexpected type, it will crash.
> An event log that contains such accumulator will be dropped because it cannot 
> be converted to JSON, and it will cause weird UI issue when rendering in 
> Spark History Server. For example, if `SparkListenerTaskEnd` is dropped 
> because of this issue, the user will see the task is still running even if it 
> was finished.
> It's better to make *accumValueToJson* more robust.
> 
> How to reproduce it:
> - Enable Spark event log
> - Run the following command:
> {code}
> scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
> accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, 
> name: Some(internal.metrics.foo), value: 0.0)
> scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) }
> 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw 
> an exception
> java.lang.ClassCastException: java.lang.Double cannot be cast to 
> java.util.List
>   at 
> org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
>   at scala.collection.immutable.List.map(List.scala:284)
>   at 
> org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299)
>   at 
> org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291)
>   at 
> org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
>   at 
> org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
>   at 
> org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
>   at 
> org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
>   at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
>   at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
>   at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
>   at 
> org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

--

[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2020-06-06 Thread Shixiong Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-31923:
-
Description: 
A user may use internal accumulators by adding the "internal.metrics." prefix 
to the accumulator name to hide sensitive information from UI (Accumulators 
except internal ones will be shown in Spark UI).

However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an 
internal accumulator has only 3 possible types: int, long, and 
java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an 
unexpected type, it will crash.

An event log that contains such accumulator will be dropped because it cannot 
be converted to JSON, and it will cause weird UI issue when rendering in Spark 
History Server. For example, if `SparkListenerTaskEnd` is dropped because of 
this issue, the user will see the task is still running even if it was finished.

It's better to make *accumValueToJson* more robust.


How to reproduce it:

- Enable Spark event log
- Run the following command:

{code}
scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: 
Some(internal.metrics.foo), value: 0.0)

scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) }
20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an 
exception
java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List
at 
org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
at scala.collection.immutable.List.map(List.scala:284)
at 
org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299)
at 
org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291)
at 
org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
at 
org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
at 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
{code}

 

  was:
A user may use internal accumulators by adding the "internal.metrics." prefix 
to the accumulator name to hide sensitive information from UI (Accumulators 
except internal ones will be shown in Spark UI). However, 
org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal 
accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, 
BlockStatus)]. When an internal accumulator uses an unexpected type, it will 
crash. An event log that contains such accumulator will be dropped because it 
cannot be converted to JSON, and it will cause weird UI issue when rendering in 
Spark History Server. For example, if `SparkListenerTaskEnd` is dropped

[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2020-06-06 Thread Shixiong Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-31923:
-
Description: 
A user may use internal accumulators by adding the "internal.metrics." prefix 
to the accumulator name to hide sensitive information from UI (Accumulators 
except internal ones will be shown in Spark UI). However, 
org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal 
accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, 
BlockStatus)]. When an internal accumulator uses an unexpected type, it will 
crash. An event log that contains such accumulator will be dropped because it 
cannot be converted to JSON, and it will cause weird UI issue when rendering in 
Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because 
of this issue, the user will see the task is still running even if it was 
finished.

It's better to make accumValueToJson more robust.

How to reproduce it:

- Enable Spark event log
- Run the following command:

{code}
scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: 
Some(internal.metrics.foo), value: 0.0)

scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) }
20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an 
exception
java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List
at 
org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
at scala.collection.immutable.List.map(List.scala:284)
at 
org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299)
at 
org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291)
at 
org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
at 
org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
at 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
{code}

 

  was:
A user may use internal accumulators by adding the "internal.metrics." prefix 
to the accumulator name to hide sensitive information from UI (Accumulators 
will be shown in Spark UI by default). However, 
org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal 
accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, 
BlockStatus)]. When an internal accumulator uses an unexpected type, it will 
crash. An event log that contains such accumulator will be dropped because it 
cannot be converted to JSON, and it will cause weird UI issue when rendering in 
Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because 
of this is

[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2020-06-06 Thread Shixiong Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-31923:
-
Description: 
A user may use internal accumulators by adding the "internal.metrics." prefix 
to the accumulator name to hide sensitive information from UI (Accumulators 
will be shown in Spark UI by default). However, 
org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal 
accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, 
BlockStatus)]. When an internal accumulator uses an unexpected type, it will 
crash. An event log that contains such accumulator will be dropped because it 
cannot be converted to JSON, and it will cause weird UI issue when rendering in 
Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because 
of this issue, the user will see the task is still running even if it was 
finished.

It's better to make accumValueToJson more robust.

How to reproduce it:

- Enable Spark event log
- Run the following command:

{code}
scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: 
Some(internal.metrics.foo), value: 0.0)

scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) }
20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an 
exception
java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List
at 
org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
at scala.collection.immutable.List.map(List.scala:284)
at 
org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299)
at 
org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291)
at 
org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
at 
org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
at 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
{code}

 

  was:
A user may use internal accumulators by adding the "internal.metrics." prefix 
to the accumulator name to hide sensitive information from UI (Accumulators 
will be shown in Spark UI by default). However, 
org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal 
accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, 
BlockStatus)]. When an internal accumulator uses an unexpected type, it will 
crash. An event log that contains such accumulator will not be able to convert 
to json, and it will cause weird UI issue when rendering in Spark History 
Server. For example, if `SparkListenerTaskEnd` is dropped because of this 
issue, the user will see the ta

[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2020-06-06 Thread Shixiong Zhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-31923:
-
Description: 
A user may use internal accumulators by adding the "internal.metrics." prefix 
to the accumulator name to hide sensitive information from UI (Accumulators 
will be shown in Spark UI by default). However, 
org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal 
accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, 
BlockStatus)]. When an internal accumulator uses an unexpected type, it will 
crash. An event log that contains such accumulator will not be able to convert 
to json, and it will cause weird UI issue when rendering in Spark History 
Server. For example, if `SparkListenerTaskEnd` is dropped because of this 
issue, the user will see the task is still running even if it was finished.

It's better to make accumValueToJson more robust.

How to reproduce it:

- Enable Spark event log
- Run the following command:

{code}
scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: 
Some(internal.metrics.foo), value: 0.0)

scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) }
20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an 
exception
java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List
at 
org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
at scala.collection.immutable.List.map(List.scala:284)
at 
org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299)
at 
org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291)
at 
org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
at 
org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
at 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
{code}

 

  was:
A user may use internal accumulators by adding the "internal.metrics." prefix 
to the accumulator name to hide sensitive information from UI (Accumulators 
will be shown in Spark UI by default). However, 
org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal 
accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, 
BlockStatus)]. When an internal accumulator uses an unexpected type, it will 
crash.

It's better to make accumValueToJson more robust.

How to reproduce it:

- Enable Spark event log
- Run the following command:

{code}
scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, na

[jira] [Created] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types

2020-06-06 Thread Shixiong Zhu (Jira)

Shixiong Zhu created SPARK-31923:


 Summary: Event log cannot be generated when some internal 
accumulators use unexpected types
 Key: SPARK-31923
 URL: https://issues.apache.org/jira/browse/SPARK-31923
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.6
Reporter: Shixiong Zhu


A user may use internal accumulators by adding the "internal.metrics." prefix 
to the accumulator name to hide sensitive information from UI (Accumulators 
will be shown in Spark UI by default). However, 
org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal 
accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, 
BlockStatus)]. When an internal accumulator uses an unexpected type, it will 
crash.

It's better to make accumValueToJson more robust.

How to reproduce it:

- Enable Spark event log
- Run the following command:

{code}
scala> val accu = sc.doubleAccumulator("internal.metrics.foo")
accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: 
Some(internal.metrics.foo), value: 0.0)

scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) }
20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an 
exception
java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List
at 
org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
at 
org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299)
at scala.collection.immutable.List.map(List.scala:284)
at 
org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299)
at 
org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291)
at 
org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145)
at 
org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76)
at 
org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138)
at 
org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158)
at 
org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at 
org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302)
at 
org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82)
{code}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0

2020-06-06 Thread Shivaram Venkataraman (Jira)

Shivaram Venkataraman created SPARK-31918:
-

 Summary: SparkR CRAN check gives a warning with R 4.0.0
 Key: SPARK-31918
 URL: https://issues.apache.org/jira/browse/SPARK-31918
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.4.6
Reporter: Shivaram Venkataraman


When the SparkR package is run through a CRAN check (i.e. with something like R 
CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
vignette as a part of the checks.

However this seems to be failing with R 4.0.0 on OSX -- both on my local 
machine and on CRAN 
https://cran.r-project.org/web/checks/check_results_SparkR.html

cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX

2020-06-06 Thread Shivaram Venkataraman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-31918:
--
Summary: SparkR CRAN check gives a warning with R 4.0.0 on OSX  (was: 
SparkR CRAN check gives a warning with R 4.0.0)

> SparkR CRAN check gives a warning with R 4.0.0 on OSX
> -
>
> Key: SPARK-31918
> URL: https://issues.apache.org/jira/browse/SPARK-31918
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.6
>Reporter: Shivaram Venkataraman
>Priority: Major
>
> When the SparkR package is run through a CRAN check (i.e. with something like 
> R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR 
> vignette as a part of the checks.
> However this seems to be failing with R 4.0.0 on OSX -- both on my local 
> machine and on CRAN 
> https://cran.r-project.org/web/checks/check_results_SparkR.html
> cc [~felixcheung]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31919) Push down more predicates through Join

2020-06-06 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-31919:
--

 Summary: Push down more predicates through Join
 Key: SPARK-31919
 URL: https://issues.apache.org/jira/browse/SPARK-31919
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


Currently, in `PushPredicateThroughJoin`, if the condition predicate of `Or` 
operator can't be entirely pushed down, it will be thrown away. 
In fact, the predicates under `Or` operators can be partially pushed down.
For example, says `a` and `b` are able to be pushed into one of the joined 
tables, while `c` can't be pushed down, the predicate
`a or (b and c)` 
can be converted as 
`(a or b) and (a or c)`
We can still push down `(a or b)`.
We can't push down disjunctive predicates only when one of its children is not 
partially convertible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over

2020-06-06 Thread Shuai Lu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127119#comment-17127119
 ] 

Shuai Lu commented on SPARK-28594:
--

Hi, [~kabhwan], are we planning to support this feature in Spark 2.4? It has 
been an issue with Spark 2.4 as well.

> Allow event logs for running streaming apps to be rolled over
> -
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Stephen Levett
>Assignee: Jungtaek Lim
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.0.0
>
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31922) TransportRequestHandler Error when exit spark-shell with local-cluster mode

2020-06-06 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31922:
-
Affects Version/s: 2.4.6

> TransportRequestHandler Error when exit spark-shell with local-cluster mode
> ---
>
> Key: SPARK-31922
> URL: https://issues.apache.org/jira/browse/SPARK-31922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.0
>Reporter: wuyi
>Priority: Major
>
> There's always an error from TransportRequestHandler when exiting spark-shell 
> under local-cluster mode:
>  
> {code:java}
> 20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() for one-way message.20/06/06 23:08:29 ERROR 
> TransportRequestHandler: Error while invoking RpcHandler#receive() for 
> one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped. at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) 
> at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) 
> at 
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
>  at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
>  at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
>  at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
>  at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) 
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.lang.Thread.run(Thread.java:748)20/06/06 23:08:29 ERROR 
> TransportRequestHandler: Error while invoking RpcHandler#receive() for 
> one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped. at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167

[jira] [Created] (SPARK-31920) Failure in converting pandas DataFrames with columns via Arrow

2020-06-06 Thread Stephen Caraher (Jira)

Stephen Caraher created SPARK-31920:
---

 Summary: Failure in converting pandas DataFrames with columns via 
Arrow
 Key: SPARK-31920
 URL: https://issues.apache.org/jira/browse/SPARK-31920
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.5, 3.0.0, 3.1.0
 Environment: pandas: 1.0.0 - 1.0.4
pyarrow: 0.15.1 - 0.17.1
Reporter: Stephen Caraher


When calling {{createDataFrame}} on a pandas DataFrame in which any of the 
columns are backed by an array implementing {{\_\_arrow_array\_\_}} 
({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail.

With pyarrow >= 0.17.0, the following exception occurs:
{noformat}
Traceback (most recent call last):
  File 
"/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", 
line 470, in test_createDataFrame_from_integer_extension_dtype
df_from_integer_ext_dtype = 
self.spark.createDataFrame(pdf_integer_ext_dtype)
  File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
line 601, in createDataFrame
data, schema, samplingRatio, verifySchema)
  File 
"/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
 line 277, in createDataFrame
return self._create_from_pandas_with_arrow(data, schema, timezone)
  File 
"/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
 line 435, in _create_from_pandas_with_arrow
jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, 
create_RDD_server)
  File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", line 
570, in _serialize_to_jvm
serializer.dump_stream(data, tempFile)
  File 
"/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
 line 204, in dump_stream
super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream)
  File 
"/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
 line 88, in dump_stream
for batch in iterator:
  File 
"/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
 line 203, in 
batches = (self._create_batch(series) for series in iterator)
  File 
"/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
 line 194, in _create_batch
arrs.append(create_array(s, t))
  File 
"/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
 line 161, in create_array
array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
  File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
  File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
  File "pyarrow/array.pxi", line 104, in 
pyarrow.lib._handle_arrow_array_protocol
ValueError: Cannot specify a mask or a size when passing an object that is 
converted with the __arrow_array__ protocol.
{noformat}

With pyarrow < 0.17.0, the conversion will fail earlier in the process, during 
schema extraction:
{noformat}
  File 
"/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", 
line 470, in test_createDataFrame_from_integer_extension_dtype
df_from_integer_ext_dtype = 
self.spark.createDataFrame(pdf_integer_ext_dtype)
  File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
line 601, in createDataFrame
data, schema, samplingRatio, verifySchema)
  File 
"/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
 line 277, in createDataFrame
return self._create_from_pandas_with_arrow(data, schema, timezone)
  File 
"/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
 line 397, in _create_from_pandas_with_arrow
arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
  File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas
  File 
"/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
 line 519, in dataframe_to_types
type_ = pa.lib._ndarray_to_arrow_type(values, type_)
  File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type
  File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type
  File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver

2020-06-06 Thread Prabhakar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127234#comment-17127234
 ] 

Prabhakar commented on SPARK-29640:
---

Is there a way to explicitly configure the Kube API server Url? if so, 
specifying the complete DNS name

e.g. kubernetes.default.svc.cluster.local might help

> [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in 
> Spark driver
> --
>
> Key: SPARK-29640
> URL: https://issues.apache.org/jira/browse/SPARK-29640
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.4
>Reporter: Andy Grove
>Priority: Major
>
> We are running into intermittent DNS issues where the Spark driver fails to 
> resolve "kubernetes.default.svc" when trying to create executors. We are 
> running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS.
> This happens approximately 10% of the time.
> Here is the stack trace:
> {code:java}
> Exception in thread "main" org.apache.spark.SparkException: External 
> scheduler cannot be instantiated
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794)
>   at org.apache.spark.SparkContext.(SparkContext.scala:493)
>   at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935)
>   at 
> org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
>   at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36)
>   at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>   at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845)
>   at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
>   at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
>   at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
>   at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929)
>   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: 
> [get]  for kind: [Pod]  with name: 
> [wf-5-69674f15d0fc45-1571354060179-driver]  in namespace: 
> [tenant-8-workflows]  failed.
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229)
>   at 
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55)
>   at scala.Option.map(Option.scala:146)
>   at 
> org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.(ExecutorPodsAllocator.scala:55)
>   at 
> org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89)
>   at 
> org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788)
>   ... 20 more
> Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again
>   at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
>   at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
>   at 
> java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
>   at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
>   at java.net.InetAddress.getAllByName(InetAddress.java:1193)
>   at java.net.InetAddress.getAllByName(InetAddress.java:1127)
>   at okhttp3.Dns$1.lookup(Dns.java:39)
>   at 
> okh

[jira] [Updated] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__

2020-06-06 Thread Stephen Caraher (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephen Caraher updated SPARK-31920:

Summary: Failure in converting pandas DataFrames with Arrow when columns 
implement __arrow_array__  (was: Failure in converting pandas DataFrames with 
columns via Arrow)

> Failure in converting pandas DataFrames with Arrow when columns implement 
> __arrow_array__
> -
>
> Key: SPARK-31920
> URL: https://issues.apache.org/jira/browse/SPARK-31920
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
> Environment: pandas: 1.0.0 - 1.0.4
> pyarrow: 0.15.1 - 0.17.1
>Reporter: Stephen Caraher
>Priority: Major
>
> When calling {{createDataFrame}} on a pandas DataFrame in which any of the 
> columns are backed by an array implementing {{\_\_arrow_array\_\_}} 
> ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail.
> With pyarrow >= 0.17.0, the following exception occurs:
> {noformat}
> Traceback (most recent call last):
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
> df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
> data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
> return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 435, in _create_from_pandas_with_arrow
> jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, 
> create_RDD_server)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", 
> line 570, in _serialize_to_jvm
> serializer.dump_stream(data, tempFile)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 204, in dump_stream
> super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 88, in dump_stream
> for batch in iterator:
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 203, in 
> batches = (self._create_batch(series) for series in iterator)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 194, in _create_batch
> arrs.append(create_array(s, t))
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 161, in create_array
> array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
>   File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 104, in 
> pyarrow.lib._handle_arrow_array_protocol
> ValueError: Cannot specify a mask or a size when passing an object that is 
> converted with the __arrow_array__ protocol.
> {noformat}
> With pyarrow < 0.17.0, the conversion will fail earlier in the process, 
> during schema extraction:
> {noformat}
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
> df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
> data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
> return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 397, in _create_from_pandas_with_arrow
> arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
>   File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas
>   File 
> "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 519, in dataframe_to_types
> type_ = pa.lib._ndarray_to_arrow_type(values, type_)
>   File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type
>   File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type
>   File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
> pyarrow.lib.ArrowTypeError: Did not pass numpy.dtyp

[jira] [Created] (SPARK-31922) TransportRequestHandler Error when exit spark-shell with local-cluster mode

2020-06-06 Thread wuyi (Jira)

wuyi created SPARK-31922:


 Summary: TransportRequestHandler Error when exit spark-shell with 
local-cluster mode
 Key: SPARK-31922
 URL: https://issues.apache.org/jira/browse/SPARK-31922
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: wuyi


There's always an error from TransportRequestHandler when exiting spark-shell 
under local-cluster mode:

 
{code:java}
20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking 
RpcHandler#receive() for one-way message.20/06/06 23:08:29 ERROR 
TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way 
message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) 
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) at 
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
 at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
 at 
org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
 at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at 
io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at 
io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at 
org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
 at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
 at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
 at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
 at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
 at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) 
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
 at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at 
io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
 at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.lang.Thread.run(Thread.java:748)20/06/06 23:08:29 ERROR 
TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way 
message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. at 
org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at 
org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) 
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) at 
org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
 at 
org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
 at 
org.apache.spark.network.server.T

[jira] [Commented] (SPARK-31922) TransportRequestHandler Error when exit spark-shell with local-cluster mode

2020-06-06 Thread wuyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127370#comment-17127370
 ] 

wuyi commented on SPARK-31922:
--

I am working on this.

> TransportRequestHandler Error when exit spark-shell with local-cluster mode
> ---
>
> Key: SPARK-31922
> URL: https://issues.apache.org/jira/browse/SPARK-31922
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> There's always an error from TransportRequestHandler when exiting spark-shell 
> under local-cluster mode:
>  
> {code:java}
> 20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking 
> RpcHandler#receive() for one-way message.20/06/06 23:08:29 ERROR 
> TransportRequestHandler: Error while invoking RpcHandler#receive() for 
> one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped. at 
> org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at 
> org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) 
> at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) 
> at 
> org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253)
>  at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111)
>  at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140)
>  at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
>  at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>  at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>  at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>  at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
>  at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) 
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
>  at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) 
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
>  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) 
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>  at java.lang.Thread.run(Thread.java:748)20/06/06 23:08:29 ERROR 
> TransportRequestHandler: Error while invoking RpcHandler#receive() for 
> one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already 
> stopped. at 
> org.apache.spark.rpc.netty

[jira] [Resolved] (SPARK-31904) Char and varchar partition columns throw MetaException

2020-06-06 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31904.
--
Fix Version/s: 3.0.0
 Assignee: Lantao Jin
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/28724]

> Char and varchar partition columns throw MetaException
> --
>
> Key: SPARK-31904
> URL: https://issues.apache.org/jira/browse/SPARK-31904
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Assignee: Lantao Jin
>Priority: Major
> Fix For: 3.0.0
>
>
> {code}
> CREATE TABLE t1(a STRING, B VARCHAR(10), C CHAR(10)) STORED AS parquet;
> CREATE TABLE t2 USING parquet PARTITIONED BY (b, c) AS SELECT * FROM t1;
> SELECT * FROM t2 WHERE b = 'A';
> {code}
> Above SQL throws MetaException
> {quote}
> Caused by: java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:810)
>   ... 114 more
> Caused by: MetaException(message:Filtering is supported only on partition 
> keys of type string, or integral types)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:184)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:439)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:356)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:278)
>   at 
> org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:583)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:3315)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2768)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:182)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3248)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3232)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2974)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:3250)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2906)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101)
>   at com.sun.proxy.$Proxy25.getPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:5093)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107)
>   at com.sun.proxy.$Proxy26.get_partitions_by_filter(Unknown Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1232)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173)
>   at com.sun.proxy.$Proxy27.listPartitionsByFilter(Unknown Source)
>   at 
> org.apache.hadoop.hiv

[jira] [Commented] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."

2020-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127358#comment-17127358
 ] 

Apache Spark commented on SPARK-31921:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28742

> Wrong warning of "WARN Master: App app-xxx requires more resource than any of 
> Workers could have."
> --
>
> Key: SPARK-31921
> URL: https://issues.apache.org/jira/browse/SPARK-31921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell 
> --master "local-cluster[2, 1, 1024]", there will be a warning:
>  
> {code:java}
> 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more 
> resource than any of Workers could have.
> {code}
> which means the application can not get enough resources to launch at least 
> one executor.
> But that's not true since we can successfully complete a job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__

2020-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31920:


Assignee: Apache Spark

> Failure in converting pandas DataFrames with Arrow when columns implement 
> __arrow_array__
> -
>
> Key: SPARK-31920
> URL: https://issues.apache.org/jira/browse/SPARK-31920
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
> Environment: pandas: 1.0.0 - 1.0.4
> pyarrow: 0.15.1 - 0.17.1
>Reporter: Stephen Caraher
>Assignee: Apache Spark
>Priority: Major
>
> When calling {{createDataFrame}} on a pandas DataFrame in which any of the 
> columns are backed by an array implementing {{\_\_arrow_array\_\_}} 
> ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail.
> With pyarrow >= 0.17.0, the following exception occurs:
> {noformat}
> Traceback (most recent call last):
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
> df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
> data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
> return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 435, in _create_from_pandas_with_arrow
> jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, 
> create_RDD_server)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", 
> line 570, in _serialize_to_jvm
> serializer.dump_stream(data, tempFile)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 204, in dump_stream
> super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 88, in dump_stream
> for batch in iterator:
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 203, in 
> batches = (self._create_batch(series) for series in iterator)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 194, in _create_batch
> arrs.append(create_array(s, t))
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 161, in create_array
> array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
>   File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 104, in 
> pyarrow.lib._handle_arrow_array_protocol
> ValueError: Cannot specify a mask or a size when passing an object that is 
> converted with the __arrow_array__ protocol.
> {noformat}
> With pyarrow < 0.17.0, the conversion will fail earlier in the process, 
> during schema extraction:
> {noformat}
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
> df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
> data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
> return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 397, in _create_from_pandas_with_arrow
> arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
>   File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas
>   File 
> "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 519, in dataframe_to_types
> type_ = pa.lib._ndarray_to_arrow_type(values, type_)
>   File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type
>   File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type
>   File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
> pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---

[jira] [Commented] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__

2020-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127366#comment-17127366
 ] 

Apache Spark commented on SPARK-31920:
--

User 'moskvax' has created a pull request for this issue:
https://github.com/apache/spark/pull/28743

> Failure in converting pandas DataFrames with Arrow when columns implement 
> __arrow_array__
> -
>
> Key: SPARK-31920
> URL: https://issues.apache.org/jira/browse/SPARK-31920
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
> Environment: pandas: 1.0.0 - 1.0.4
> pyarrow: 0.15.1 - 0.17.1
>Reporter: Stephen Caraher
>Priority: Major
>
> When calling {{createDataFrame}} on a pandas DataFrame in which any of the 
> columns are backed by an array implementing {{\_\_arrow_array\_\_}} 
> ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail.
> With pyarrow >= 0.17.0, the following exception occurs:
> {noformat}
> Traceback (most recent call last):
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
> df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
> data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
> return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 435, in _create_from_pandas_with_arrow
> jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, 
> create_RDD_server)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", 
> line 570, in _serialize_to_jvm
> serializer.dump_stream(data, tempFile)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 204, in dump_stream
> super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 88, in dump_stream
> for batch in iterator:
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 203, in 
> batches = (self._create_batch(series) for series in iterator)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 194, in _create_batch
> arrs.append(create_array(s, t))
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 161, in create_array
> array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
>   File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 104, in 
> pyarrow.lib._handle_arrow_array_protocol
> ValueError: Cannot specify a mask or a size when passing an object that is 
> converted with the __arrow_array__ protocol.
> {noformat}
> With pyarrow < 0.17.0, the conversion will fail earlier in the process, 
> during schema extraction:
> {noformat}
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
> df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
> data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
> return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 397, in _create_from_pandas_with_arrow
> arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
>   File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas
>   File 
> "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 519, in dataframe_to_types
> type_ = pa.lib._ndarray_to_arrow_type(values, type_)
>   File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type
>   File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type
>   File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
> pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object
> {noformat}

[jira] [Assigned] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__

2020-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31920:


Assignee: (was: Apache Spark)

> Failure in converting pandas DataFrames with Arrow when columns implement 
> __arrow_array__
> -
>
> Key: SPARK-31920
> URL: https://issues.apache.org/jira/browse/SPARK-31920
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
> Environment: pandas: 1.0.0 - 1.0.4
> pyarrow: 0.15.1 - 0.17.1
>Reporter: Stephen Caraher
>Priority: Major
>
> When calling {{createDataFrame}} on a pandas DataFrame in which any of the 
> columns are backed by an array implementing {{\_\_arrow_array\_\_}} 
> ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail.
> With pyarrow >= 0.17.0, the following exception occurs:
> {noformat}
> Traceback (most recent call last):
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
> df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
> data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
> return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 435, in _create_from_pandas_with_arrow
> jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, 
> create_RDD_server)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", 
> line 570, in _serialize_to_jvm
> serializer.dump_stream(data, tempFile)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 204, in dump_stream
> super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 88, in dump_stream
> for batch in iterator:
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 203, in 
> batches = (self._create_batch(series) for series in iterator)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 194, in _create_batch
> arrs.append(create_array(s, t))
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 161, in create_array
> array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
>   File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 104, in 
> pyarrow.lib._handle_arrow_array_protocol
> ValueError: Cannot specify a mask or a size when passing an object that is 
> converted with the __arrow_array__ protocol.
> {noformat}
> With pyarrow < 0.17.0, the conversion will fail earlier in the process, 
> during schema extraction:
> {noformat}
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
> df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
> data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
> return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 397, in _create_from_pandas_with_arrow
> arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
>   File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas
>   File 
> "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 519, in dataframe_to_types
> type_ = pa.lib._ndarray_to_arrow_type(values, type_)
>   File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type
>   File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type
>   File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
> pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__

2020-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127365#comment-17127365
 ] 

Apache Spark commented on SPARK-31920:
--

User 'moskvax' has created a pull request for this issue:
https://github.com/apache/spark/pull/28743

> Failure in converting pandas DataFrames with Arrow when columns implement 
> __arrow_array__
> -
>
> Key: SPARK-31920
> URL: https://issues.apache.org/jira/browse/SPARK-31920
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
> Environment: pandas: 1.0.0 - 1.0.4
> pyarrow: 0.15.1 - 0.17.1
>Reporter: Stephen Caraher
>Priority: Major
>
> When calling {{createDataFrame}} on a pandas DataFrame in which any of the 
> columns are backed by an array implementing {{\_\_arrow_array\_\_}} 
> ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail.
> With pyarrow >= 0.17.0, the following exception occurs:
> {noformat}
> Traceback (most recent call last):
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
> df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
> data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
> return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 435, in _create_from_pandas_with_arrow
> jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, 
> create_RDD_server)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", 
> line 570, in _serialize_to_jvm
> serializer.dump_stream(data, tempFile)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 204, in dump_stream
> super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 88, in dump_stream
> for batch in iterator:
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 203, in 
> batches = (self._create_batch(series) for series in iterator)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 194, in _create_batch
> arrs.append(create_array(s, t))
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py",
>  line 161, in create_array
> array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
>   File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas
>   File "pyarrow/array.pxi", line 215, in pyarrow.lib.array
>   File "pyarrow/array.pxi", line 104, in 
> pyarrow.lib._handle_arrow_array_protocol
> ValueError: Cannot specify a mask or a size when passing an object that is 
> converted with the __arrow_array__ protocol.
> {noformat}
> With pyarrow < 0.17.0, the conversion will fail earlier in the process, 
> during schema extraction:
> {noformat}
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py",
>  line 470, in test_createDataFrame_from_integer_extension_dtype
> df_from_integer_ext_dtype = 
> self.spark.createDataFrame(pdf_integer_ext_dtype)
>   File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", 
> line 601, in createDataFrame
> data, schema, samplingRatio, verifySchema)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 277, in createDataFrame
> return self._create_from_pandas_with_arrow(data, schema, timezone)
>   File 
> "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py",
>  line 397, in _create_from_pandas_with_arrow
> arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False)
>   File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas
>   File 
> "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py",
>  line 519, in dataframe_to_types
> type_ = pa.lib._ndarray_to_arrow_type(values, type_)
>   File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type
>   File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type
>   File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status
> pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object
> {noformat}

[jira] [Assigned] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."

2020-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31921:


Assignee: (was: Apache Spark)

> Wrong warning of "WARN Master: App app-xxx requires more resource than any of 
> Workers could have."
> --
>
> Key: SPARK-31921
> URL: https://issues.apache.org/jira/browse/SPARK-31921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell 
> --master "local-cluster[2, 1, 1024]", there will be a warning:
>  
> {code:java}
> 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more 
> resource than any of Workers could have.
> {code}
> which means the application can not get enough resources to launch at least 
> one executor.
> But that's not true since we can successfully complete a job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."

2020-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31921:


Assignee: Apache Spark

> Wrong warning of "WARN Master: App app-xxx requires more resource than any of 
> Workers could have."
> --
>
> Key: SPARK-31921
> URL: https://issues.apache.org/jira/browse/SPARK-31921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
> When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell 
> --master "local-cluster[2, 1, 1024]", there will be a warning:
>  
> {code:java}
> 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more 
> resource than any of Workers could have.
> {code}
> which means the application can not get enough resources to launch at least 
> one executor.
> But that's not true since we can successfully complete a job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."

2020-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127357#comment-17127357
 ] 

Apache Spark commented on SPARK-31921:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28742

> Wrong warning of "WARN Master: App app-xxx requires more resource than any of 
> Workers could have."
> --
>
> Key: SPARK-31921
> URL: https://issues.apache.org/jira/browse/SPARK-31921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell 
> --master "local-cluster[2, 1, 1024]", there will be a warning:
>  
> {code:java}
> 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more 
> resource than any of Workers could have.
> {code}
> which means the application can not get enough resources to launch at least 
> one executor.
> But that's not true since we can successfully complete a job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."

2020-06-06 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31921:
-
Description: 
When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell 
--master "local-cluster[2, 1, 1024]", there will be a warning:

 
{code:java}
20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more 
resource than any of Workers could have.
{code}
which means the application can not get enough resources to launch at least one 
executor.

But that's not true since we can successfully complete a job.

  was:
When start spark-shell using local cluster mode, e.g. ./bin/spark-shell 
--master "local-cluster[2, 1, 1024]", there will be a warning:

 
{code:java}
20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more 
resource than any of Workers could have.
{code}

which means the application can not get enough resources to launch at least one 
executor.

But that's not true since we can successfully complete a job.


> Wrong warning of "WARN Master: App app-xxx requires more resource than any of 
> Workers could have."
> --
>
> Key: SPARK-31921
> URL: https://issues.apache.org/jira/browse/SPARK-31921
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell 
> --master "local-cluster[2, 1, 1024]", there will be a warning:
>  
> {code:java}
> 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more 
> resource than any of Workers could have.
> {code}
> which means the application can not get enough resources to launch at least 
> one executor.
> But that's not true since we can successfully complete a job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."

2020-06-06 Thread wuyi (Jira)

wuyi created SPARK-31921:


 Summary: Wrong warning of "WARN Master: App app-xxx requires more 
resource than any of Workers could have."
 Key: SPARK-31921
 URL: https://issues.apache.org/jira/browse/SPARK-31921
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: wuyi


When start spark-shell using local cluster mode, e.g. ./bin/spark-shell 
--master "local-cluster[2, 1, 1024]", there will be a warning:

 
{code:java}
20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more 
resource than any of Workers could have.
{code}

which means the application can not get enough resources to launch at least one 
executor.

But that's not true since we can successfully complete a job.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31903) toPandas with Arrow enabled doesn't show metrics in Query UI.

2020-06-06 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31903:
-
Fix Version/s: 2.4.7

> toPandas with Arrow enabled doesn't show metrics in Query UI.
> -
>
> Key: SPARK-31903
> URL: https://issues.apache.org/jira/browse/SPARK-31903
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, R
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.0.0, 2.4.7
>
> Attachments: Screen Shot 2020-06-03 at 4.47.07 PM.png, Screen Shot 
> 2020-06-03 at 4.47.27 PM.png
>
>
> When calling {{toPandas}}, usually Query UI shows each plan node's metric and 
> corresponding Stage ID and Task ID:
> {code:java}
> >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 
> >>> 'y', 'z'])
> >>> df.toPandas()
>x   yz
> 0  1  10  abc
> 1  2  20  def
> {code}
> !Screen Shot 2020-06-03 at 4.47.07 PM.png!
> but if Arrow execution is enabled, it shows only plan nodes and the duration 
> is not correct:
> {code:java}
> >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
> >>> df.toPandas()
>x   yz
> 0  1  10  abc
> 1  2  20  def{code}
>  
> !Screen Shot 2020-06-03 at 4.47.27 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31919) Push down more predicates through Join

2020-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31919:


Assignee: Apache Spark  (was: Gengliang Wang)

> Push down more predicates through Join
> --
>
> Key: SPARK-31919
> URL: https://issues.apache.org/jira/browse/SPARK-31919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>
> Currently, in `PushPredicateThroughJoin`, if the condition predicate of `Or` 
> operator can't be entirely pushed down, it will be thrown away. 
> In fact, the predicates under `Or` operators can be partially pushed down.
> For example, says `a` and `b` are able to be pushed into one of the joined 
> tables, while `c` can't be pushed down, the predicate
> `a or (b and c)` 
> can be converted as 
> `(a or b) and (a or c)`
> We can still push down `(a or b)`.
> We can't push down disjunctive predicates only when one of its children is 
> not partially convertible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31919) Push down more predicates through Join

2020-06-06 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127263#comment-17127263
 ] 

Apache Spark commented on SPARK-31919:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/28741

> Push down more predicates through Join
> --
>
> Key: SPARK-31919
> URL: https://issues.apache.org/jira/browse/SPARK-31919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, in `PushPredicateThroughJoin`, if the condition predicate of `Or` 
> operator can't be entirely pushed down, it will be thrown away. 
> In fact, the predicates under `Or` operators can be partially pushed down.
> For example, says `a` and `b` are able to be pushed into one of the joined 
> tables, while `c` can't be pushed down, the predicate
> `a or (b and c)` 
> can be converted as 
> `(a or b) and (a or c)`
> We can still push down `(a or b)`.
> We can't push down disjunctive predicates only when one of its children is 
> not partially convertible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31919) Push down more predicates through Join

2020-06-06 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31919:


Assignee: Gengliang Wang  (was: Apache Spark)

> Push down more predicates through Join
> --
>
> Key: SPARK-31919
> URL: https://issues.apache.org/jira/browse/SPARK-31919
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
>
> Currently, in `PushPredicateThroughJoin`, if the condition predicate of `Or` 
> operator can't be entirely pushed down, it will be thrown away. 
> In fact, the predicates under `Or` operators can be partially pushed down.
> For example, says `a` and `b` are able to be pushed into one of the joined 
> tables, while `c` can't be pushed down, the predicate
> `a or (b and c)` 
> can be converted as 
> `(a or b) and (a or c)`
> We can still push down `(a or b)`.
> We can't push down disjunctive predicates only when one of its children is 
> not partially convertible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

38 matches

Mail list logo