[ https://issues.apache.org/jira/browse/SPARK-9103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181543#comment-16181543 ]
Imran Rashid commented on SPARK-9103: ------------------------------------- Hi [~jerryshao], sorry it took me a little while to respond. I think having the info available in the metric system is great, but I see two different types of short-comings: 1) MetricsSystem / graphite etc. are great, but its really hard to correlate the timeline view you get with what is actually going on in your job. Did the tasks that took the longest also correlate to the tasks that used the most memory? Were there some phases of your application which had pressure on one part of memory (eg., execution memory) and another part which had pressure on another part (eg., user memory)? What was the memory usage for tasks that failed? And for tasks that were slow? It seems *really* hard to answer those questions via graphite, as you have to do some mental joins from task to executor to the graphite view in the right time frame, and the rollups aren't exactly right, etc. (Or maybe there is some sophisticated way to do this in graphite that I don't know about?) It also just seems like something that should be baked into the spark UI, and not require a 3rd party tool ... perhaps its overkill for the spark UI to handle this (too big a load on the driver for large apps?), but would like to see if something better is possible. 2) Even within the metric system, I'm not sure we are capturing everything we need. The most obvious thing to me is capturing the total process memory -- even off-heap memory which isn't part of the jvm managed memory, whether its memory managed by spark, or even a 3rd party lib (eg. parquet). I have a feeling there are more things that would be useful to capture, though I haven't done a full audit of what is currently exposed in the metric system to be honest. > Tracking spark's memory usage > ----------------------------- > > Key: SPARK-9103 > URL: https://issues.apache.org/jira/browse/SPARK-9103 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI > Reporter: Zhang, Liye > Attachments: Tracking Spark Memory Usage - Phase 1.pdf > > > Currently spark only provides little memory usage information (RDD cache on > webUI) for the executors. User have no idea on what is the memory consumption > when they are running spark applications with a lot of memory used in spark > executors. Especially when they encounter the OOM, it’s really hard to know > what is the cause of the problem. So it would be helpful to give out the > detail memory consumption information for each part of spark, so that user > can clearly have a picture of where the memory is exactly used. > The memory usage info to expose should include but not limited to shuffle, > cache, network, serializer, etc. > User can optionally choose to open this functionality since this is mainly > for debugging and tuning. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org