[jira] [Updated] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()

Imran Rashid (JIRA) Tue, 28 Mar 2017 06:30:12 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-20128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Imran Rashid updated SPARK-20128:
---------------------------------
    Description: 
One Jenkins run failed due to the MetricsSystem never getting killed after a 
failed test, which led that test to hang and the tests to timeout:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176

{noformat}
17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR 
DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting 
down SparkContext
java.lang.ArrayIndexOutOfBoundsException: -1
        at 
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
        at 
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
        at scala.Option.flatMap(Option.scala:171)
        at 
org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO 
MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped
17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: 
BlockManagerMaster stopped
17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
stopped SparkContext
17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR 
ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. 
Exception was suppressed.
java.lang.NullPointerException
        at 
org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35)
        at 
org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34)
        at 
com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239)
...
{noformat}

unfortunately I didn't save the entire test logs, but what happens is the 
initial IndexOutOfBoundsException is a real bug, which causes the SparkContext 
to stop, and the test to fail.  However, the MetricsSystem somehow stays alive, 
and since its not a daemon thread, it just hangs, and every 20 mins we get that 
NPE from within the metrics system as it tries to report.

I am totally perplexed at how this can happen, it looks like the metric system 
should always get stopped by the time we see

{noformat}
17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
stopped SparkContext
{noformat}

I don't think I've ever seen this in a real spark use, but it doesn't look like 
something which is limited to tests, whatever the cause.

  was:
One Jenkins run failed due to the MetricsSystem never getting killed after a 
failed test, which led that test to hang and the tests to timeout:

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176

{noformat}
17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR 
DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting 
down SparkContext
java.lang.ArrayIndexOutOfBoundsException: -1
        at 
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
        at 
org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
        at scala.Option.flatMap(Option.scala:171)
        at 
org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
        at 
org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
        at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO 
MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager stopped
17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: 
BlockManagerMaster stopped
17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO 
OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
OutputCommitCoordinator stopped!
17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
stopped SparkContext
17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR 
ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. 
Exception was suppressed.
java.lang.NullPointerException
        at 
org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35)
        at 
org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34)
        at 
com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239)
...
{noformat}

unfortunately I didn't save the entire test logs, but what happens is the 
initial IndexOutOfBoundsException is a real bug, which causes the SparkContext 
to stop, and the test to fail.  However, the MetricsSystem somehow stays alive, 
and since its not a daemon thread, it just hangs, and every 20 mins we get that 
NPE from within the metrics system as it tries to report.

I am totally perplexed at how this can happen, it looks like the metric system 
should always get stopped by the time we see

{noformat}
17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
stopped SparkContext
{noformat]

I don't think I've ever seen this in a real spark use, but it doesn't look like 
something which is limited to tests, whatever the cause.


> MetricsSystem not always killed in SparkContext.stop()
> ------------------------------------------------------
>
>                 Key: SPARK-20128
>                 URL: https://issues.apache.org/jira/browse/SPARK-20128
>             Project: Spark
>          Issue Type: Test
>          Components: Spark Core, Tests
>    Affects Versions: 2.2.0
>            Reporter: Imran Rashid
>
> One Jenkins run failed due to the MetricsSystem never getting killed after a 
> failed test, which led that test to hang and the tests to timeout:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75176
> {noformat}
> 17/03/24 13:44:19.537 dag-scheduler-event-loop ERROR 
> DAGSchedulerEventProcessLoop: DAGSchedulerEventProcessLoop failed; shutting 
> down SparkContext
> java.lang.ArrayIndexOutOfBoundsException: -1
>         at 
> org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:431)
>         at 
> org.apache.spark.MapOutputTrackerMaster$$anonfun$getEpochForMapOutput$1.apply(MapOutputTracker.scala:430)
>         at scala.Option.flatMap(Option.scala:171)
>         at 
> org.apache.spark.MapOutputTrackerMaster.getEpochForMapOutput(MapOutputTracker.scala:430)
>         at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1298)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1731)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1689)
>         at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1678)
>         at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 17/03/24 13:44:19.540 dispatcher-event-loop-11 INFO 
> MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
> 17/03/24 13:44:19.546 stop-spark-context INFO MemoryStore: MemoryStore cleared
> 17/03/24 13:44:19.546 stop-spark-context INFO BlockManager: BlockManager 
> stopped
> 17/03/24 13:44:19.546 stop-spark-context INFO BlockManagerMaster: 
> BlockManagerMaster stopped
> 17/03/24 13:44:19.546 dispatcher-event-loop-16 INFO 
> OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
> stopped SparkContext
> 17/03/24 14:02:19.934 metrics-console-reporter-1-thread-1 ERROR 
> ScheduledReporter: RuntimeException thrown from ConsoleReporter#report. 
> Exception was suppressed.
> java.lang.NullPointerException
>         at 
> org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:35)
>         at 
> org.apache.spark.deploy.master.MasterSource$$anon$2.getValue(MasterSource.scala:34)
>         at 
> com.codahale.metrics.ConsoleReporter.printGauge(ConsoleReporter.java:239)
> ...
> {noformat}
> unfortunately I didn't save the entire test logs, but what happens is the 
> initial IndexOutOfBoundsException is a real bug, which causes the 
> SparkContext to stop, and the test to fail.  However, the MetricsSystem 
> somehow stays alive, and since its not a daemon thread, it just hangs, and 
> every 20 mins we get that NPE from within the metrics system as it tries to 
> report.
> I am totally perplexed at how this can happen, it looks like the metric 
> system should always get stopped by the time we see
> {noformat}
> 17/03/24 13:44:19.547 stop-spark-context INFO SparkContext: Successfully 
> stopped SparkContext
> {noformat}
> I don't think I've ever seen this in a real spark use, but it doesn't look 
> like something which is limited to tests, whatever the cause.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-20128) MetricsSystem not always killed in SparkContext.stop()

Reply via email to