[jira] [Updated] (SPARK-31924) Create remote shuffle service reference implementation
[ https://issues.apache.org/jira/browse/SPARK-31924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BoYang updated SPARK-31924: --- Description: People in [Spark Scalability & Reliability Sync Meeting |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have discussed a lot about remote (disaggregated) shuffle service, and plan to do a reference implementation to help demonstrate some basic design and pave the way for a future production grade remote shuffle service. There are already two pull requests to enhance Spark shuffle metadata API to make it easy/possible to implement remote shuffle service ([PR 28616|https://github.com/apache/spark/pull/28616], [PR 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle service reference implementation will help to validate those shuffle metadata API. was: People in [Spark Scalability & Reliability Sync Meeting|[https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]] have discussed a lot about remote (disaggregated) shuffle service, and plan to do a reference implementation to help demonstrate some basic design and pave the way for a future production grade remote shuffle service. There are already two pull requests to enhance Spark shuffle metadata API to make it easy/possible to implement remote shuffle service ([PR 28616|https://github.com/apache/spark/pull/28616], [PR 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle service reference implementation will help to validate those shuffle metadata API. > Create remote shuffle service reference implementation > -- > > Key: SPARK-31924 > URL: https://issues.apache.org/jira/browse/SPARK-31924 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: BoYang >Priority: Major > Fix For: 3.0.0 > > > People in [Spark Scalability & Reliability Sync Meeting > |https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]have > discussed a lot about remote (disaggregated) shuffle service, and plan to do > a reference implementation to help demonstrate some basic design and pave the > way for a future production grade remote shuffle service. > > There are already two pull requests to enhance Spark shuffle metadata API to > make it easy/possible to implement remote shuffle service ([PR > 28616|https://github.com/apache/spark/pull/28616], [PR > 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle > service reference implementation will help to validate those shuffle metadata > API. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31924) Create remote shuffle service reference implementation
BoYang created SPARK-31924: -- Summary: Create remote shuffle service reference implementation Key: SPARK-31924 URL: https://issues.apache.org/jira/browse/SPARK-31924 Project: Spark Issue Type: New Feature Components: Shuffle Affects Versions: 3.0.0 Reporter: BoYang Fix For: 3.0.0 People in [Spark Scalability & Reliability Sync Meeting|[https://docs.google.com/document/d/1T3y25dOaKWVO0pWd838GeiTeI3DUQJtwy6MKYPLuleg]] have discussed a lot about remote (disaggregated) shuffle service, and plan to do a reference implementation to help demonstrate some basic design and pave the way for a future production grade remote shuffle service. There are already two pull requests to enhance Spark shuffle metadata API to make it easy/possible to implement remote shuffle service ([PR 28616|https://github.com/apache/spark/pull/28616], [PR 28618|https://github.com/apache/spark/pull/28618]). Creating a remote shuffle service reference implementation will help to validate those shuffle metadata API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30119) Support pagination for spark streaming tab
[ https://issues.apache.org/jira/browse/SPARK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta resolved SPARK-30119. Fix Version/s: 3.1.0 Resolution: Fixed This issue is resolved in https://github.com/apache/spark/pull/28439 > Support pagination for spark streaming tab > --- > > Key: SPARK-30119 > URL: https://issues.apache.org/jira/browse/SPARK-30119 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: jobit mathew >Assignee: Rakesh Raushan >Priority: Minor > Fix For: 3.1.0 > > > Support pagination for spark streaming tab -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30119) Support pagination for spark streaming tab
[ https://issues.apache.org/jira/browse/SPARK-30119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta reassigned SPARK-30119: -- Assignee: Rakesh Raushan > Support pagination for spark streaming tab > --- > > Key: SPARK-30119 > URL: https://issues.apache.org/jira/browse/SPARK-30119 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.1.0 >Reporter: jobit mathew >Assignee: Rakesh Raushan >Priority: Minor > > Support pagination for spark streaming tab -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127444#comment-17127444 ] Apache Spark commented on SPARK-31923: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/28744 > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Shixiong Zhu >Priority: Major > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.sc
[jira] [Commented] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127443#comment-17127443 ] Apache Spark commented on SPARK-31923: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/28744 > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Shixiong Zhu >Priority: Major > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.sc
[jira] [Assigned] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31923: Assignee: Apache Spark > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) -
[jira] [Assigned] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31923: Assignee: (was: Apache Spark) > Event log cannot be generated when some internal accumulators use unexpected > types > -- > > Key: SPARK-31923 > URL: https://issues.apache.org/jira/browse/SPARK-31923 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6 >Reporter: Shixiong Zhu >Priority: Major > > A user may use internal accumulators by adding the "internal.metrics." prefix > to the accumulator name to hide sensitive information from UI (Accumulators > except internal ones will be shown in Spark UI). > However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an > internal accumulator has only 3 possible types: int, long, and > java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an > unexpected type, it will crash. > An event log that contains such accumulator will be dropped because it cannot > be converted to JSON, and it will cause weird UI issue when rendering in > Spark History Server. For example, if `SparkListenerTaskEnd` is dropped > because of this issue, the user will see the task is still running even if it > was finished. > It's better to make *accumValueToJson* more robust. > > How to reproduce it: > - Enable Spark event log > - Run the following command: > {code} > scala> val accu = sc.doubleAccumulator("internal.metrics.foo") > accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, > name: Some(internal.metrics.foo), value: 0.0) > scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } > 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw > an exception > java.lang.ClassCastException: java.lang.Double cannot be cast to > java.util.List > at > org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) > at scala.collection.immutable.List.map(List.scala:284) > at > org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) > at > org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) > at > org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) > at > org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) > at > org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) > at > org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) > at > org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) > at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) > at > org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) > at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) > at > org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Description: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators except internal ones will be shown in Spark UI). However, *org.apache.spark.util.JsonProtocol.accumValueToJson* assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will be dropped because it cannot be converted to JSON, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this issue, the user will see the task is still running even if it was finished. It's better to make *accumValueToJson* more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: Some(internal.metrics.foo), value: 0.0) scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at scala.Option.map(Option.scala:146) at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at scala.collection.immutable.List.map(List.scala:284) at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) {code} was: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators except internal ones will be shown in Spark UI). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will be dropped because it cannot be converted to JSON, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Description: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators except internal ones will be shown in Spark UI). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will be dropped because it cannot be converted to JSON, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this issue, the user will see the task is still running even if it was finished. It's better to make accumValueToJson more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: Some(internal.metrics.foo), value: 0.0) scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at scala.Option.map(Option.scala:146) at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at scala.collection.immutable.List.map(List.scala:284) at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) {code} was: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will be dropped because it cannot be converted to JSON, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this is
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Description: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will be dropped because it cannot be converted to JSON, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this issue, the user will see the task is still running even if it was finished. It's better to make accumValueToJson more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: Some(internal.metrics.foo), value: 0.0) scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at scala.Option.map(Option.scala:146) at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at scala.collection.immutable.List.map(List.scala:284) at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) {code} was: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will not be able to convert to json, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this issue, the user will see the ta
[jira] [Updated] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
[ https://issues.apache.org/jira/browse/SPARK-31923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-31923: - Description: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. An event log that contains such accumulator will not be able to convert to json, and it will cause weird UI issue when rendering in Spark History Server. For example, if `SparkListenerTaskEnd` is dropped because of this issue, the user will see the task is still running even if it was finished. It's better to make accumValueToJson more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: Some(internal.metrics.foo), value: 0.0) scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at scala.Option.map(Option.scala:146) at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at scala.collection.immutable.List.map(List.scala:284) at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) {code} was: A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. It's better to make accumValueToJson more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, na
[jira] [Created] (SPARK-31923) Event log cannot be generated when some internal accumulators use unexpected types
Shixiong Zhu created SPARK-31923: Summary: Event log cannot be generated when some internal accumulators use unexpected types Key: SPARK-31923 URL: https://issues.apache.org/jira/browse/SPARK-31923 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.6 Reporter: Shixiong Zhu A user may use internal accumulators by adding the "internal.metrics." prefix to the accumulator name to hide sensitive information from UI (Accumulators will be shown in Spark UI by default). However, org.apache.spark.util.JsonProtocol.accumValueToJson assumes an internal accumulator has only 3 possible types: int, long, and java.util.List[(BlockId, BlockStatus)]. When an internal accumulator uses an unexpected type, it will crash. It's better to make accumValueToJson more robust. How to reproduce it: - Enable Spark event log - Run the following command: {code} scala> val accu = sc.doubleAccumulator("internal.metrics.foo") accu: org.apache.spark.util.DoubleAccumulator = DoubleAccumulator(id: 0, name: Some(internal.metrics.foo), value: 0.0) scala> sc.parallelize(1 to 1, 1).foreach { _ => accu.add(1.0) } 20/06/06 16:11:27 ERROR AsyncEventQueue: Listener EventLoggingListener threw an exception java.lang.ClassCastException: java.lang.Double cannot be cast to java.util.List at org.apache.spark.util.JsonProtocol$.accumValueToJson(JsonProtocol.scala:330) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulableInfoToJson$3.apply(JsonProtocol.scala:306) at scala.Option.map(Option.scala:146) at org.apache.spark.util.JsonProtocol$.accumulableInfoToJson(JsonProtocol.scala:306) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$$anonfun$accumulablesToJson$2.apply(JsonProtocol.scala:299) at scala.collection.immutable.List.map(List.scala:284) at org.apache.spark.util.JsonProtocol$.accumulablesToJson(JsonProtocol.scala:299) at org.apache.spark.util.JsonProtocol$.taskInfoToJson(JsonProtocol.scala:291) at org.apache.spark.util.JsonProtocol$.taskEndToJson(JsonProtocol.scala:145) at org.apache.spark.util.JsonProtocol$.sparkEventToJson(JsonProtocol.scala:76) at org.apache.spark.scheduler.EventLoggingListener.logEvent(EventLoggingListener.scala:138) at org.apache.spark.scheduler.EventLoggingListener.onTaskEnd(EventLoggingListener.scala:158) at org.apache.spark.scheduler.SparkListenerBus$class.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:37) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:91) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$super$postToAll(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply$mcJ$sp(AsyncEventQueue.scala:92) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anonfun$org$apache$spark$scheduler$AsyncEventQueue$$dispatch$1.apply(AsyncEventQueue.scala:87) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:87) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1$$anonfun$run$1.apply$mcV$sp(AsyncEventQueue.scala:83) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1302) at org.apache.spark.scheduler.AsyncEventQueue$$anon$1.run(AsyncEventQueue.scala:82) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0
Shivaram Venkataraman created SPARK-31918: - Summary: SparkR CRAN check gives a warning with R 4.0.0 Key: SPARK-31918 URL: https://issues.apache.org/jira/browse/SPARK-31918 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.4.6 Reporter: Shivaram Venkataraman When the SparkR package is run through a CRAN check (i.e. with something like R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR vignette as a part of the checks. However this seems to be failing with R 4.0.0 on OSX -- both on my local machine and on CRAN https://cran.r-project.org/web/checks/check_results_SparkR.html cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31918) SparkR CRAN check gives a warning with R 4.0.0 on OSX
[ https://issues.apache.org/jira/browse/SPARK-31918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman updated SPARK-31918: -- Summary: SparkR CRAN check gives a warning with R 4.0.0 on OSX (was: SparkR CRAN check gives a warning with R 4.0.0) > SparkR CRAN check gives a warning with R 4.0.0 on OSX > - > > Key: SPARK-31918 > URL: https://issues.apache.org/jira/browse/SPARK-31918 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.6 >Reporter: Shivaram Venkataraman >Priority: Major > > When the SparkR package is run through a CRAN check (i.e. with something like > R CMD check --as-cran ~/Downloads/SparkR_2.4.6.tar.gz), we rebuild the SparkR > vignette as a part of the checks. > However this seems to be failing with R 4.0.0 on OSX -- both on my local > machine and on CRAN > https://cran.r-project.org/web/checks/check_results_SparkR.html > cc [~felixcheung] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31919) Push down more predicates through Join
Gengliang Wang created SPARK-31919: -- Summary: Push down more predicates through Join Key: SPARK-31919 URL: https://issues.apache.org/jira/browse/SPARK-31919 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Gengliang Wang Assignee: Gengliang Wang Currently, in `PushPredicateThroughJoin`, if the condition predicate of `Or` operator can't be entirely pushed down, it will be thrown away. In fact, the predicates under `Or` operators can be partially pushed down. For example, says `a` and `b` are able to be pushed into one of the joined tables, while `c` can't be pushed down, the predicate `a or (b and c)` can be converted as `(a or b) and (a or c)` We can still push down `(a or b)`. We can't push down disjunctive predicates only when one of its children is not partially convertible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over
[ https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127119#comment-17127119 ] Shuai Lu commented on SPARK-28594: -- Hi, [~kabhwan], are we planning to support this feature in Spark 2.4? It has been an issue with Spark 2.4 as well. > Allow event logs for running streaming apps to be rolled over > - > > Key: SPARK-28594 > URL: https://issues.apache.org/jira/browse/SPARK-28594 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Stephen Levett >Assignee: Jungtaek Lim >Priority: Major > Labels: releasenotes > Fix For: 3.0.0 > > > At all current Spark releases when event logging on spark streaming is > enabled the event logs grow massively. The files continue to grow until the > application is stopped or killed. > The Spark history server then has difficulty processing the files. > https://issues.apache.org/jira/browse/SPARK-8617 > Addresses .inprogress files but not event log files that are still running. > Identify a mechanism to set a "max file" size so that the file is rolled over > when it reaches this size? > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31922) TransportRequestHandler Error when exit spark-shell with local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-31922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31922: - Affects Version/s: 2.4.6 > TransportRequestHandler Error when exit spark-shell with local-cluster mode > --- > > Key: SPARK-31922 > URL: https://issues.apache.org/jira/browse/SPARK-31922 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6, 3.0.0 >Reporter: wuyi >Priority: Major > > There's always an error from TransportRequestHandler when exiting spark-shell > under local-cluster mode: > > {code:java} > 20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking > RpcHandler#receive() for one-way message.20/06/06 23:08:29 ERROR > TransportRequestHandler: Error while invoking RpcHandler#receive() for > one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) > at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) > at > org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748)20/06/06 23:08:29 ERROR > TransportRequestHandler: Error while invoking RpcHandler#receive() for > one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167
[jira] [Created] (SPARK-31920) Failure in converting pandas DataFrames with columns via Arrow
Stephen Caraher created SPARK-31920: --- Summary: Failure in converting pandas DataFrames with columns via Arrow Key: SPARK-31920 URL: https://issues.apache.org/jira/browse/SPARK-31920 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.5, 3.0.0, 3.1.0 Environment: pandas: 1.0.0 - 1.0.4 pyarrow: 0.15.1 - 0.17.1 Reporter: Stephen Caraher When calling {{createDataFrame}} on a pandas DataFrame in which any of the columns are backed by an array implementing {{\_\_arrow_array\_\_}} ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail. With pyarrow >= 0.17.0, the following exception occurs: {noformat} Traceback (most recent call last): File "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", line 470, in test_createDataFrame_from_integer_extension_dtype df_from_integer_ext_dtype = self.spark.createDataFrame(pdf_integer_ext_dtype) File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", line 601, in createDataFrame data, schema, samplingRatio, verifySchema) File "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", line 277, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", line 435, in _create_from_pandas_with_arrow jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, create_RDD_server) File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", line 570, in _serialize_to_jvm serializer.dump_stream(data, tempFile) File "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", line 204, in dump_stream super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream) File "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", line 88, in dump_stream for batch in iterator: File "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", line 203, in batches = (self._create_batch(series) for series in iterator) File "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", line 194, in _create_batch arrs.append(create_array(s, t)) File "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", line 161, in create_array array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas File "pyarrow/array.pxi", line 215, in pyarrow.lib.array File "pyarrow/array.pxi", line 104, in pyarrow.lib._handle_arrow_array_protocol ValueError: Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol. {noformat} With pyarrow < 0.17.0, the conversion will fail earlier in the process, during schema extraction: {noformat} File "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", line 470, in test_createDataFrame_from_integer_extension_dtype df_from_integer_ext_dtype = self.spark.createDataFrame(pdf_integer_ext_dtype) File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", line 601, in createDataFrame data, schema, samplingRatio, verifySchema) File "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", line 277, in createDataFrame return self._create_from_pandas_with_arrow(data, schema, timezone) File "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", line 397, in _create_from_pandas_with_arrow arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas File "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 519, in dataframe_to_types type_ = pa.lib._ndarray_to_arrow_type(values, type_) File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29640) [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in Spark driver
[ https://issues.apache.org/jira/browse/SPARK-29640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127234#comment-17127234 ] Prabhakar commented on SPARK-29640: --- Is there a way to explicitly configure the Kube API server Url? if so, specifying the complete DNS name e.g. kubernetes.default.svc.cluster.local might help > [K8S] Intermittent "java.net.UnknownHostException: kubernetes.default.svc" in > Spark driver > -- > > Key: SPARK-29640 > URL: https://issues.apache.org/jira/browse/SPARK-29640 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Spark Core >Affects Versions: 2.4.4 >Reporter: Andy Grove >Priority: Major > > We are running into intermittent DNS issues where the Spark driver fails to > resolve "kubernetes.default.svc" when trying to create executors. We are > running Spark 2.4.4 (with the patch for SPARK-28921) in cluster mode in EKS. > This happens approximately 10% of the time. > Here is the stack trace: > {code:java} > Exception in thread "main" org.apache.spark.SparkException: External > scheduler cannot be instantiated > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2794) > at org.apache.spark.SparkContext.(SparkContext.scala:493) > at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:935) > at > org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:926) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926) > at com.rms.execution.test.SparkPiTask$.main(SparkPiTask.scala:36) > at com.rms.execution.test.SparkPiTask.main(SparkPiTask.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: > [get] for kind: [Pod] with name: > [wf-5-69674f15d0fc45-1571354060179-driver] in namespace: > [tenant-8-workflows] failed. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:229) > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.get(BaseOperation.java:162) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:57) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator$$anonfun$1.apply(ExecutorPodsAllocator.scala:55) > at scala.Option.map(Option.scala:146) > at > org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.(ExecutorPodsAllocator.scala:55) > at > org.apache.spark.scheduler.cluster.k8s.KubernetesClusterManager.createSchedulerBackend(KubernetesClusterManager.scala:89) > at > org.apache.spark.SparkContext$.org$apache$spark$SparkContext$$createTaskScheduler(SparkContext.scala:2788) > ... 20 more > Caused by: java.net.UnknownHostException: kubernetes.default.svc: Try again > at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) > at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) > at > java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) > at java.net.InetAddress.getAllByName0(InetAddress.java:1277) > at java.net.InetAddress.getAllByName(InetAddress.java:1193) > at java.net.InetAddress.getAllByName(InetAddress.java:1127) > at okhttp3.Dns$1.lookup(Dns.java:39) > at > okh
[jira] [Updated] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__
[ https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen Caraher updated SPARK-31920: Summary: Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__ (was: Failure in converting pandas DataFrames with columns via Arrow) > Failure in converting pandas DataFrames with Arrow when columns implement > __arrow_array__ > - > > Key: SPARK-31920 > URL: https://issues.apache.org/jira/browse/SPARK-31920 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0, 3.1.0 > Environment: pandas: 1.0.0 - 1.0.4 > pyarrow: 0.15.1 - 0.17.1 >Reporter: Stephen Caraher >Priority: Major > > When calling {{createDataFrame}} on a pandas DataFrame in which any of the > columns are backed by an array implementing {{\_\_arrow_array\_\_}} > ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail. > With pyarrow >= 0.17.0, the following exception occurs: > {noformat} > Traceback (most recent call last): > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 435, in _create_from_pandas_with_arrow > jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, > create_RDD_server) > File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", > line 570, in _serialize_to_jvm > serializer.dump_stream(data, tempFile) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 204, in dump_stream > super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 88, in dump_stream > for batch in iterator: > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 203, in > batches = (self._create_batch(series) for series in iterator) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 194, in _create_batch > arrs.append(create_array(s, t)) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 161, in create_array > array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) > File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas > File "pyarrow/array.pxi", line 215, in pyarrow.lib.array > File "pyarrow/array.pxi", line 104, in > pyarrow.lib._handle_arrow_array_protocol > ValueError: Cannot specify a mask or a size when passing an object that is > converted with the __arrow_array__ protocol. > {noformat} > With pyarrow < 0.17.0, the conversion will fail earlier in the process, > during schema extraction: > {noformat} > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 397, in _create_from_pandas_with_arrow > arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) > File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas > File > "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 519, in dataframe_to_types > type_ = pa.lib._ndarray_to_arrow_type(values, type_) > File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type > File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type > File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status > pyarrow.lib.ArrowTypeError: Did not pass numpy.dtyp
[jira] [Created] (SPARK-31922) TransportRequestHandler Error when exit spark-shell with local-cluster mode
wuyi created SPARK-31922: Summary: TransportRequestHandler Error when exit spark-shell with local-cluster mode Key: SPARK-31922 URL: https://issues.apache.org/jira/browse/SPARK-31922 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: wuyi There's always an error from TransportRequestHandler when exiting spark-shell under local-cluster mode: {code:java} 20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748)20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) at org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253) at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111) at org.apache.spark.network.server.T
[jira] [Commented] (SPARK-31922) TransportRequestHandler Error when exit spark-shell with local-cluster mode
[ https://issues.apache.org/jira/browse/SPARK-31922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127370#comment-17127370 ] wuyi commented on SPARK-31922: -- I am working on this. > TransportRequestHandler Error when exit spark-shell with local-cluster mode > --- > > Key: SPARK-31922 > URL: https://issues.apache.org/jira/browse/SPARK-31922 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > There's always an error from TransportRequestHandler when exiting spark-shell > under local-cluster mode: > > {code:java} > 20/06/06 23:08:29 ERROR TransportRequestHandler: Error while invoking > RpcHandler#receive() for one-way message.20/06/06 23:08:29 ERROR > TransportRequestHandler: Error while invoking RpcHandler#receive() for > one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. at > org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:167) at > org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:150) > at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:691) > at > org.apache.spark.network.server.TransportRequestHandler.processOneWayMessage(TransportRequestHandler.java:253) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:111) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:140) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:102) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.lang.Thread.run(Thread.java:748)20/06/06 23:08:29 ERROR > TransportRequestHandler: Error while invoking RpcHandler#receive() for > one-way message.org.apache.spark.rpc.RpcEnvStoppedException: RpcEnv already > stopped. at > org.apache.spark.rpc.netty
[jira] [Resolved] (SPARK-31904) Char and varchar partition columns throw MetaException
[ https://issues.apache.org/jira/browse/SPARK-31904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31904. -- Fix Version/s: 3.0.0 Assignee: Lantao Jin Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/28724] > Char and varchar partition columns throw MetaException > -- > > Key: SPARK-31904 > URL: https://issues.apache.org/jira/browse/SPARK-31904 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Lantao Jin >Assignee: Lantao Jin >Priority: Major > Fix For: 3.0.0 > > > {code} > CREATE TABLE t1(a STRING, B VARCHAR(10), C CHAR(10)) STORED AS parquet; > CREATE TABLE t2 USING parquet PARTITIONED BY (b, c) AS SELECT * FROM t1; > SELECT * FROM t2 WHERE b = 'A'; > {code} > Above SQL throws MetaException > {quote} > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:810) > ... 114 more > Caused by: MetaException(message:Filtering is supported only on partition > keys of type string, or integral types) > at > org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:184) > at > org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:439) > at > org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:356) > at > org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:278) > at > org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:583) > at > org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:3315) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2768) > at > org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:182) > at > org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3248) > at > org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3232) > at > org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2974) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:3250) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2906) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101) > at com.sun.proxy.$Proxy25.getPartitionsByFilter(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:5093) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) > at > org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) > at com.sun.proxy.$Proxy26.get_partitions_by_filter(Unknown Source) > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1232) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) > at com.sun.proxy.$Proxy27.listPartitionsByFilter(Unknown Source) > at > org.apache.hadoop.hiv
[jira] [Commented] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."
[ https://issues.apache.org/jira/browse/SPARK-31921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127358#comment-17127358 ] Apache Spark commented on SPARK-31921: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28742 > Wrong warning of "WARN Master: App app-xxx requires more resource than any of > Workers could have." > -- > > Key: SPARK-31921 > URL: https://issues.apache.org/jira/browse/SPARK-31921 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell > --master "local-cluster[2, 1, 1024]", there will be a warning: > > {code:java} > 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more > resource than any of Workers could have. > {code} > which means the application can not get enough resources to launch at least > one executor. > But that's not true since we can successfully complete a job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__
[ https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31920: Assignee: Apache Spark > Failure in converting pandas DataFrames with Arrow when columns implement > __arrow_array__ > - > > Key: SPARK-31920 > URL: https://issues.apache.org/jira/browse/SPARK-31920 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0, 3.1.0 > Environment: pandas: 1.0.0 - 1.0.4 > pyarrow: 0.15.1 - 0.17.1 >Reporter: Stephen Caraher >Assignee: Apache Spark >Priority: Major > > When calling {{createDataFrame}} on a pandas DataFrame in which any of the > columns are backed by an array implementing {{\_\_arrow_array\_\_}} > ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail. > With pyarrow >= 0.17.0, the following exception occurs: > {noformat} > Traceback (most recent call last): > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 435, in _create_from_pandas_with_arrow > jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, > create_RDD_server) > File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", > line 570, in _serialize_to_jvm > serializer.dump_stream(data, tempFile) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 204, in dump_stream > super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 88, in dump_stream > for batch in iterator: > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 203, in > batches = (self._create_batch(series) for series in iterator) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 194, in _create_batch > arrs.append(create_array(s, t)) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 161, in create_array > array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) > File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas > File "pyarrow/array.pxi", line 215, in pyarrow.lib.array > File "pyarrow/array.pxi", line 104, in > pyarrow.lib._handle_arrow_array_protocol > ValueError: Cannot specify a mask or a size when passing an object that is > converted with the __arrow_array__ protocol. > {noformat} > With pyarrow < 0.17.0, the conversion will fail earlier in the process, > during schema extraction: > {noformat} > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 397, in _create_from_pandas_with_arrow > arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) > File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas > File > "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 519, in dataframe_to_types > type_ = pa.lib._ndarray_to_arrow_type(values, type_) > File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type > File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type > File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status > pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) ---
[jira] [Commented] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__
[ https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127366#comment-17127366 ] Apache Spark commented on SPARK-31920: -- User 'moskvax' has created a pull request for this issue: https://github.com/apache/spark/pull/28743 > Failure in converting pandas DataFrames with Arrow when columns implement > __arrow_array__ > - > > Key: SPARK-31920 > URL: https://issues.apache.org/jira/browse/SPARK-31920 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0, 3.1.0 > Environment: pandas: 1.0.0 - 1.0.4 > pyarrow: 0.15.1 - 0.17.1 >Reporter: Stephen Caraher >Priority: Major > > When calling {{createDataFrame}} on a pandas DataFrame in which any of the > columns are backed by an array implementing {{\_\_arrow_array\_\_}} > ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail. > With pyarrow >= 0.17.0, the following exception occurs: > {noformat} > Traceback (most recent call last): > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 435, in _create_from_pandas_with_arrow > jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, > create_RDD_server) > File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", > line 570, in _serialize_to_jvm > serializer.dump_stream(data, tempFile) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 204, in dump_stream > super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 88, in dump_stream > for batch in iterator: > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 203, in > batches = (self._create_batch(series) for series in iterator) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 194, in _create_batch > arrs.append(create_array(s, t)) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 161, in create_array > array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) > File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas > File "pyarrow/array.pxi", line 215, in pyarrow.lib.array > File "pyarrow/array.pxi", line 104, in > pyarrow.lib._handle_arrow_array_protocol > ValueError: Cannot specify a mask or a size when passing an object that is > converted with the __arrow_array__ protocol. > {noformat} > With pyarrow < 0.17.0, the conversion will fail earlier in the process, > during schema extraction: > {noformat} > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 397, in _create_from_pandas_with_arrow > arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) > File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas > File > "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 519, in dataframe_to_types > type_ = pa.lib._ndarray_to_arrow_type(values, type_) > File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type > File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type > File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status > pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object > {noformat}
[jira] [Assigned] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__
[ https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31920: Assignee: (was: Apache Spark) > Failure in converting pandas DataFrames with Arrow when columns implement > __arrow_array__ > - > > Key: SPARK-31920 > URL: https://issues.apache.org/jira/browse/SPARK-31920 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0, 3.1.0 > Environment: pandas: 1.0.0 - 1.0.4 > pyarrow: 0.15.1 - 0.17.1 >Reporter: Stephen Caraher >Priority: Major > > When calling {{createDataFrame}} on a pandas DataFrame in which any of the > columns are backed by an array implementing {{\_\_arrow_array\_\_}} > ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail. > With pyarrow >= 0.17.0, the following exception occurs: > {noformat} > Traceback (most recent call last): > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 435, in _create_from_pandas_with_arrow > jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, > create_RDD_server) > File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", > line 570, in _serialize_to_jvm > serializer.dump_stream(data, tempFile) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 204, in dump_stream > super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 88, in dump_stream > for batch in iterator: > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 203, in > batches = (self._create_batch(series) for series in iterator) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 194, in _create_batch > arrs.append(create_array(s, t)) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 161, in create_array > array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) > File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas > File "pyarrow/array.pxi", line 215, in pyarrow.lib.array > File "pyarrow/array.pxi", line 104, in > pyarrow.lib._handle_arrow_array_protocol > ValueError: Cannot specify a mask or a size when passing an object that is > converted with the __arrow_array__ protocol. > {noformat} > With pyarrow < 0.17.0, the conversion will fail earlier in the process, > during schema extraction: > {noformat} > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 397, in _create_from_pandas_with_arrow > arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) > File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas > File > "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 519, in dataframe_to_types > type_ = pa.lib._ndarray_to_arrow_type(values, type_) > File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type > File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type > File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status > pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (SPARK-31920) Failure in converting pandas DataFrames with Arrow when columns implement __arrow_array__
[ https://issues.apache.org/jira/browse/SPARK-31920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127365#comment-17127365 ] Apache Spark commented on SPARK-31920: -- User 'moskvax' has created a pull request for this issue: https://github.com/apache/spark/pull/28743 > Failure in converting pandas DataFrames with Arrow when columns implement > __arrow_array__ > - > > Key: SPARK-31920 > URL: https://issues.apache.org/jira/browse/SPARK-31920 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5, 3.0.0, 3.1.0 > Environment: pandas: 1.0.0 - 1.0.4 > pyarrow: 0.15.1 - 0.17.1 >Reporter: Stephen Caraher >Priority: Major > > When calling {{createDataFrame}} on a pandas DataFrame in which any of the > columns are backed by an array implementing {{\_\_arrow_array\_\_}} > ({{StringArray}}, {{IntegerArray}}, etc), the conversion will fail. > With pyarrow >= 0.17.0, the following exception occurs: > {noformat} > Traceback (most recent call last): > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 435, in _create_from_pandas_with_arrow > jrdd = self._sc._serialize_to_jvm(arrow_data, ser, reader_func, > create_RDD_server) > File "/Users/stephen/Documents/github/spark/python/pyspark/context.py", > line 570, in _serialize_to_jvm > serializer.dump_stream(data, tempFile) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 204, in dump_stream > super(ArrowStreamPandasSerializer, self).dump_stream(batches, stream) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 88, in dump_stream > for batch in iterator: > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 203, in > batches = (self._create_batch(series) for series in iterator) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 194, in _create_batch > arrs.append(create_array(s, t)) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/serializers.py", > line 161, in create_array > array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck) > File "pyarrow/array.pxi", line 805, in pyarrow.lib.Array.from_pandas > File "pyarrow/array.pxi", line 215, in pyarrow.lib.array > File "pyarrow/array.pxi", line 104, in > pyarrow.lib._handle_arrow_array_protocol > ValueError: Cannot specify a mask or a size when passing an object that is > converted with the __arrow_array__ protocol. > {noformat} > With pyarrow < 0.17.0, the conversion will fail earlier in the process, > during schema extraction: > {noformat} > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/tests/test_arrow.py", > line 470, in test_createDataFrame_from_integer_extension_dtype > df_from_integer_ext_dtype = > self.spark.createDataFrame(pdf_integer_ext_dtype) > File "/Users/stephen/Documents/github/spark/python/pyspark/sql/session.py", > line 601, in createDataFrame > data, schema, samplingRatio, verifySchema) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 277, in createDataFrame > return self._create_from_pandas_with_arrow(data, schema, timezone) > File > "/Users/stephen/Documents/github/spark/python/pyspark/sql/pandas/conversion.py", > line 397, in _create_from_pandas_with_arrow > arrow_schema = pa.Schema.from_pandas(pdf, preserve_index=False) > File "pyarrow/types.pxi", line 1078, in pyarrow.lib.Schema.from_pandas > File > "/Users/stephen/opt/miniconda3/envs/spark-dev/lib/python3.7/site-packages/pyarrow/pandas_compat.py", > line 519, in dataframe_to_types > type_ = pa.lib._ndarray_to_arrow_type(values, type_) > File "pyarrow/array.pxi", line 53, in pyarrow.lib._ndarray_to_arrow_type > File "pyarrow/array.pxi", line 64, in pyarrow.lib._ndarray_to_type > File "pyarrow/error.pxi", line 107, in pyarrow.lib.check_status > pyarrow.lib.ArrowTypeError: Did not pass numpy.dtype object > {noformat}
[jira] [Assigned] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."
[ https://issues.apache.org/jira/browse/SPARK-31921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31921: Assignee: (was: Apache Spark) > Wrong warning of "WARN Master: App app-xxx requires more resource than any of > Workers could have." > -- > > Key: SPARK-31921 > URL: https://issues.apache.org/jira/browse/SPARK-31921 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell > --master "local-cluster[2, 1, 1024]", there will be a warning: > > {code:java} > 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more > resource than any of Workers could have. > {code} > which means the application can not get enough resources to launch at least > one executor. > But that's not true since we can successfully complete a job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."
[ https://issues.apache.org/jira/browse/SPARK-31921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31921: Assignee: Apache Spark > Wrong warning of "WARN Master: App app-xxx requires more resource than any of > Workers could have." > -- > > Key: SPARK-31921 > URL: https://issues.apache.org/jira/browse/SPARK-31921 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell > --master "local-cluster[2, 1, 1024]", there will be a warning: > > {code:java} > 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more > resource than any of Workers could have. > {code} > which means the application can not get enough resources to launch at least > one executor. > But that's not true since we can successfully complete a job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."
[ https://issues.apache.org/jira/browse/SPARK-31921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127357#comment-17127357 ] Apache Spark commented on SPARK-31921: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28742 > Wrong warning of "WARN Master: App app-xxx requires more resource than any of > Workers could have." > -- > > Key: SPARK-31921 > URL: https://issues.apache.org/jira/browse/SPARK-31921 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell > --master "local-cluster[2, 1, 1024]", there will be a warning: > > {code:java} > 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more > resource than any of Workers could have. > {code} > which means the application can not get enough resources to launch at least > one executor. > But that's not true since we can successfully complete a job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."
[ https://issues.apache.org/jira/browse/SPARK-31921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31921: - Description: When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell --master "local-cluster[2, 1, 1024]", there will be a warning: {code:java} 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more resource than any of Workers could have. {code} which means the application can not get enough resources to launch at least one executor. But that's not true since we can successfully complete a job. was: When start spark-shell using local cluster mode, e.g. ./bin/spark-shell --master "local-cluster[2, 1, 1024]", there will be a warning: {code:java} 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more resource than any of Workers could have. {code} which means the application can not get enough resources to launch at least one executor. But that's not true since we can successfully complete a job. > Wrong warning of "WARN Master: App app-xxx requires more resource than any of > Workers could have." > -- > > Key: SPARK-31921 > URL: https://issues.apache.org/jira/browse/SPARK-31921 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > When starting spark-shell using local cluster mode, e.g. ./bin/spark-shell > --master "local-cluster[2, 1, 1024]", there will be a warning: > > {code:java} > 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more > resource than any of Workers could have. > {code} > which means the application can not get enough resources to launch at least > one executor. > But that's not true since we can successfully complete a job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31921) Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have."
wuyi created SPARK-31921: Summary: Wrong warning of "WARN Master: App app-xxx requires more resource than any of Workers could have." Key: SPARK-31921 URL: https://issues.apache.org/jira/browse/SPARK-31921 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: wuyi When start spark-shell using local cluster mode, e.g. ./bin/spark-shell --master "local-cluster[2, 1, 1024]", there will be a warning: {code:java} 20/06/06 22:09:09 WARN Master: App app-20200606220908- requires more resource than any of Workers could have. {code} which means the application can not get enough resources to launch at least one executor. But that's not true since we can successfully complete a job. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31903) toPandas with Arrow enabled doesn't show metrics in Query UI.
[ https://issues.apache.org/jira/browse/SPARK-31903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31903: - Fix Version/s: 2.4.7 > toPandas with Arrow enabled doesn't show metrics in Query UI. > - > > Key: SPARK-31903 > URL: https://issues.apache.org/jira/browse/SPARK-31903 > Project: Spark > Issue Type: Bug > Components: PySpark, R >Affects Versions: 2.4.5, 3.0.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.0.0, 2.4.7 > > Attachments: Screen Shot 2020-06-03 at 4.47.07 PM.png, Screen Shot > 2020-06-03 at 4.47.27 PM.png > > > When calling {{toPandas}}, usually Query UI shows each plan node's metric and > corresponding Stage ID and Task ID: > {code:java} > >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', > >>> 'y', 'z']) > >>> df.toPandas() >x yz > 0 1 10 abc > 1 2 20 def > {code} > !Screen Shot 2020-06-03 at 4.47.07 PM.png! > but if Arrow execution is enabled, it shows only plan nodes and the duration > is not correct: > {code:java} > >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True) > >>> df.toPandas() >x yz > 0 1 10 abc > 1 2 20 def{code} > > !Screen Shot 2020-06-03 at 4.47.27 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31919) Push down more predicates through Join
[ https://issues.apache.org/jira/browse/SPARK-31919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31919: Assignee: Apache Spark (was: Gengliang Wang) > Push down more predicates through Join > -- > > Key: SPARK-31919 > URL: https://issues.apache.org/jira/browse/SPARK-31919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > Currently, in `PushPredicateThroughJoin`, if the condition predicate of `Or` > operator can't be entirely pushed down, it will be thrown away. > In fact, the predicates under `Or` operators can be partially pushed down. > For example, says `a` and `b` are able to be pushed into one of the joined > tables, while `c` can't be pushed down, the predicate > `a or (b and c)` > can be converted as > `(a or b) and (a or c)` > We can still push down `(a or b)`. > We can't push down disjunctive predicates only when one of its children is > not partially convertible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31919) Push down more predicates through Join
[ https://issues.apache.org/jira/browse/SPARK-31919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17127263#comment-17127263 ] Apache Spark commented on SPARK-31919: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/28741 > Push down more predicates through Join > -- > > Key: SPARK-31919 > URL: https://issues.apache.org/jira/browse/SPARK-31919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, in `PushPredicateThroughJoin`, if the condition predicate of `Or` > operator can't be entirely pushed down, it will be thrown away. > In fact, the predicates under `Or` operators can be partially pushed down. > For example, says `a` and `b` are able to be pushed into one of the joined > tables, while `c` can't be pushed down, the predicate > `a or (b and c)` > can be converted as > `(a or b) and (a or c)` > We can still push down `(a or b)`. > We can't push down disjunctive predicates only when one of its children is > not partially convertible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31919) Push down more predicates through Join
[ https://issues.apache.org/jira/browse/SPARK-31919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31919: Assignee: Gengliang Wang (was: Apache Spark) > Push down more predicates through Join > -- > > Key: SPARK-31919 > URL: https://issues.apache.org/jira/browse/SPARK-31919 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > > Currently, in `PushPredicateThroughJoin`, if the condition predicate of `Or` > operator can't be entirely pushed down, it will be thrown away. > In fact, the predicates under `Or` operators can be partially pushed down. > For example, says `a` and `b` are able to be pushed into one of the joined > tables, while `c` can't be pushed down, the predicate > `a or (b and c)` > can be converted as > `(a or b) and (a or c)` > We can still push down `(a or b)`. > We can't push down disjunctive predicates only when one of its children is > not partially convertible. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org