[jira] [Commented] (SPARK-12514) Spark MetricsSystem can fill disks/cause OOMs when using GangliaSink

Jonathan Kelly (JIRA) Wed, 03 Feb 2016 16:28:55 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131439#comment-15131439
 ]


Jonathan Kelly commented on SPARK-12514:
----------------------------------------

As of Spark 1.6.0, there don't seem to be *any* Spark metrics that are not 
prefixed by the YARN application ID, so "filter out application specific 
metrics" basically means "don't use Ganglia", right? I kid, but doesn't this 
make Spark+Ganglia integration pretty useless because Ganglia can't scale to 
the number of unique metrics that Spark is generating?

(I say, "as of Spark 1.6.0" because before Spark 1.6.0 the DAGScheduler metrics 
were not prefixed by the YARN application ID, but I see that this actually 
appears to have been a bug that was fixed in Spark 1.6.0 with 
https://issues.apache.org/jira/browse/SPARK-11828.)

> Spark MetricsSystem can fill disks/cause OOMs when using GangliaSink
> --------------------------------------------------------------------
>
>                 Key: SPARK-12514
>                 URL: https://issues.apache.org/jira/browse/SPARK-12514
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2
>            Reporter: Aaron Tokhy
>            Priority: Minor
>
> The MetricsSystem implementation in Spark generates unique metric names for 
> each spark application that has been submitted (to a YARN cluster, for 
> example).  This can be problematic for certain metrics environments, like 
> Ganglia.
> This creates metric names that look like the following (for each submitted 
> application):
> application_1450753701508_0001.driver.ExecutorAllocationManager.executors.numberAllExecutors
>  
> On Spark clusters where thousands of applications are submitted, some metrics 
> will eventually cause Ganglia daemons to reach their memory limits (gmond), 
> or to run out of disk space (gmetad).  This is due to the fact that some 
> existing metrics systems do not expect new metric names to be generated in 
> the lifetime of a cluster.
> Ganglia as a spark metrics sink is one example of where the current 
> implementation can run into problems.  Each new set of metrics per 
> application introduces a new set of RRD files that are never deleted (round 
> robin databases) and metrics in gmetad/gmond, which can cause the gmond 
> aggregator's memory usage to bloat over time, and gmetad to generate new 
> round robin databases for every new set of metrics, per application.  These 
> round robin databases are permanent, so each new set of metrics will 
> introduce files that would never be cleaned up.
> So the MetricsSystem may need to account for metrics sinks that have problems 
> with the introduction of new metrics, and buildRegistryName would have to 
> behave differently in this case.
> https://github.com/apache/spark/blob/d83c2f9f0b08d6d5d369d9fae04cdb15448e7f0d/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala#L126



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12514) Spark MetricsSystem can fill disks/cause OOMs when using GangliaSink

Reply via email to