Sean R. Owen created SPARK-29795:
------------------------------------

             Summary: Possible 'leak' of Metrics with dropwizard metrics 4.x
                 Key: SPARK-29795
                 URL: https://issues.apache.org/jira/browse/SPARK-29795
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core
    Affects Versions: 3.0.0
            Reporter: Sean R. Owen
            Assignee: Sean R. Owen


This one's a little complex to explain.

SPARK-29674 updated dropwizard metrics to 4.x, for Spark 3 only. That appears 
to be fine, according to tests. We have not and do not intend to backport it to 
2.4.x.

However, I'm working with a few people trying to back-port this to Spark 2.4.x 
separately. When this update is applied, tests fail readily with 
OutOfMemoryError, typically around ExternalShuffleServiceSuite in core. A heap 
dump analysis shows that MetricRegistry objects are retaining a gigabyte or 
more of memory.

It appears to be holding references to many large internal Spark objects like 
BlockManager and Netty objects, via closures we pass to Gauge objects. Although 
it looked odd, this may or may not be an issue; in normal usage where a JVM 
hosts one SparkContext, this may normal.

However in tests where contexts are started/restarted repeatedly, it seems like 
this might 'leak' old references to old context-related objects across runs via 
metrics. I don't have a clear theory on how yet (is SparkEnv shared or some ref 
held to it?), besides the empirical evidence. However, it's also not clear why 
this wouldn't affect Spark 3, apparently, as tests work fine. It could be 
another fix in Spark 3 that happens to help here; it could be that Spark 3 uses 
less memory and never hits the issue.

Despite that uncertainty, I've found that simply clearing the registered 
metrics from MetricsSystem when it is stop()-ped seems to resolve the issue. At 
this point, Spark is shutting down and sinks have stopped, so there doesn't 
seem to be any harm in manually releasing all registered metrics and objects. I 
don't _think_ it's intended to track metrics across two instantiations of a 
SparkContext in the same JVM, but that's a question.

That's the change I will propose in a PR.

Why does this not happen in 2.4 + metrics 3.x? unclear. We've not seen any test 
failures like this in 2.4 or reports of problems with metrics-related memory 
pressure. It could be a change in how 4.x behaves, tracks objects, manages 
lifecycles.

The difference does not seem to be Scala 2.11 vs 2.12, by the way. 2.4 works 
fine on both without the 4.x update; runs out of memory on both with the change.

Why do this if this only affects 2.4 + metrics 4.x and we're not moving to 
metrics 4.x in 2.4? It could still be a smaller issue in Spark 3, not detected 
by tests. It may help apps that do for various reasons run multiple 
SparkContexts per JVM - like many other project test suites. It may just be 
good for tidiness in shutdown, to manually clear resources.

Therefore I can't call this a bug per se, maybe an improvement.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to