[ https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Imran Rashid updated SPARK-20128: --------------------------------- Description: One Jenkins run failed due to the MetricsSystem never getting killed after a failed test, which led that test to hang and the tests to timeout: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176 {noformat} 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting down SparkContext java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431) at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430) at scala.Option.flatMap(Option.scala:171) at org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: BlockManagerMaster stopped 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully stopped SparkContext 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. Exception was suppressed. java.lang.NullPointerException at org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35) at org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34) at com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239) ... {noformat} unfortunately I didn't save the entire test logs, but what happens is the initial IndexOutOfBoundsException is a real bug, which causes the SparkContext to stop, and the test to fail. However, the MetricsSystem somehow stays alive, and since its not a daemon thread, it just hangs, and every 20 mins we get that NPE from within the metrics system as it tries to report. I am totally perplexed at how this can happen, it looks like the metric system should always get stopped by the time we see {noformat} 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully stopped SparkContext {noformat} I don't think I've ever seen this in a real spark use, but it doesn't look like something which is limited to tests, whatever the cause. was: One Jenkins run failed due to the MetricsSystem never getting killed after a failed test, which led that test to hang and the tests to timeout: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176 {noformat} 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting down SparkContext java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431) at org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430) at scala.Option.flatMap(Option.scala:171) at org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430) at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: BlockManagerMaster stopped 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully stopped SparkContext 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. Exception was suppressed. java.lang.NullPointerException at org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35) at org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34) at com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239) ... {noformat} unfortunately I didn't save the entire test logs, but what happens is the initial IndexOutOfBoundsException is a real bug, which causes the SparkContext to stop, and the test to fail. However, the MetricsSystem somehow stays alive, and since its not a daemon thread, it just hangs, and every 20 mins we get that NPE from within the metrics system as it tries to report. I am totally perplexed at how this can happen, it looks like the metric system should always get stopped by the time we see {noformat} 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully stopped SparkContext {noformat] I don't think I've ever seen this in a real spark use, but it doesn't look like something which is limited to tests, whatever the cause. > MetricsSystem not always killed in SparkContext.stop() > ------------------------------------------------------ > > Key: SPARK-20128 > URL: https://issues.apache.org/jira/browse/SPARK-20128 > Project: Spark > Issue Type: Test > Components: Spark Core, Tests > Affects Versions: 2.2.0 > Reporter: Imran Rashid > > One Jenkins run failed due to the MetricsSystem never getting killed after a > failed test, which led that test to hang and the tests to timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176 > {noformat} > 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR > DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting > down SparkContext > java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431) > at > org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430) > at scala.Option.flatMap(Option.scala:171) > at > org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO > MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! > 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared > 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager > stopped > 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: > BlockManagerMaster stopped > 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully > stopped SparkContext > 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR > ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. > Exception was suppressed. > java.lang.NullPointerException > at > org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35) > at > org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34) > at > com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239) > ... > {noformat} > unfortunately I didn't save the entire test logs, but what happens is the > initial IndexOutOfBoundsException is a real bug, which causes the > SparkContext to stop, and the test to fail. However, the MetricsSystem > somehow stays alive, and since its not a daemon thread, it just hangs, and > every 20 mins we get that NPE from within the metrics system as it tries to > report. > I am totally perplexed at how this can happen, it looks like the metric > system should always get stopped by the time we see > {noformat} > 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully > stopped SparkContext > {noformat} > I don't think I've ever seen this in a real spark use, but it doesn't look > like something which is limited to tests, whatever the cause. -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org