[ https://issues.apache.org/jira/browse/SPARK-21157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061582#comment-16061582 ]
Jose Soltren commented on SPARK-21157: -------------------------------------- Hello Marcelo - It's true, the design doc doesn't discuss flights very well. Let me give my thoughts on them here, and I'll propagate this back to the design doc at a later time. So, first off, I thought for a while to come up with a catchy name for "a period of time during which the number of stages in execution is constant". I came up with "flight". If you have a better term I would love to hear your thoughts. Let's stick with "flight" for now. After giving it some thought, I don't think the end user needs to care about flights at all. Here's what I think the user does care about: seeing some metrics (min/max/mean/stdev) for how different types of memory are consumed for a particular stage. I haven't worked out the details of the data store yet, but, what I envision is a data store of key-value pairs, where the key is the start time of a particular flight, and the values are the metrics associated with that flight and the stages that were running for the duration of that flight. Then, for a particular stage, we would be able to query all of the flights during which this stage was active, get min/max/mean/stdev metrics for each of those flights, and aggregate them to get total metrics for that particular stage. These total metrics for the stage would be shown in the Stages UI. Of course, with this data store, you could directly query statistics for a particular flight. Note that there is not a precise way to determine memory used for a particular stage at a given time unless it was the only stage active in that flight. If memory usage for stages were constant then we could possibly impute the memory usage for a single stage given all of its flight statistics. This is not feasible, so, the UI would be clear that these were total memory metrics for executors while the stage was running, and not specific to that stage. Even this should be enough for an end user to do some detective work and determine which stage is hogging memory. I glossed over some of these details since I thought they were well covered in SPARK-9103. I hope this clarifies things somewhat. If not, please let me know how I can clarify this further. Cheers. > Report Total Memory Used by Spark Executors > ------------------------------------------- > > Key: SPARK-21157 > URL: https://issues.apache.org/jira/browse/SPARK-21157 > Project: Spark > Issue Type: Improvement > Components: Input/Output > Affects Versions: 2.1.1 > Reporter: Jose Soltren > Attachments: TotalMemoryReportingDesignDoc.pdf > > > Building on some of the core ideas of SPARK-9103, this JIRA proposes tracking > total memory used by Spark executors, and a means of broadcasting, > aggregating, and reporting memory usage data in the Spark UI. > Here, "total memory used" refers to memory usage that is visible outside of > Spark, to an external observer such as YARN, Mesos, or the operating system. > The goal of this enhancement is to give Spark users more information about > how Spark clusters are using memory. Total memory will include non-Spark JVM > memory and all off-heap memory. > Please consult the attached design document for further details. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org