[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020789#comment-14020789 ] Mridul Muralidharan commented on SPARK-2064: Depending on how long a job runs, this can cause OOM on the master. In yarn (and mesos ?) an executor on the same node gets different port if relaunched on failure - and so end up as different executor in the list. web ui should not remove executors if they are dead --- Key: SPARK-2064 URL: https://issues.apache.org/jira/browse/SPARK-2064 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We should always show the list of executors that have ever been connected, and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020897#comment-14020897 ] Reynold Xin commented on SPARK-2064: Is memory really an issue here? On a 1000 node cluster, let's say we need 1KB to track each executor (should be more than enough), then we need 1MB to track all of them. In less than 100MB, we can crash restart all of them 100 times. If it really becomes the problem perhaps we can clean dead ones after a certain time period. web ui should not remove executors if they are dead --- Key: SPARK-2064 URL: https://issues.apache.org/jira/browse/SPARK-2064 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We should always show the list of executors that have ever been connected, and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020936#comment-14020936 ] Mridul Muralidharan commented on SPARK-2064: It is 100 MB (or more) of memory which could be used elsewhere. In our clusters, for example, the number of workers can be very high while the containers can be quite ephemeral when under load (and so lot of container losses); on other hand, memory per container is constrained to about 8 gig (lower when we account for overheads, etc). So the amount of working memory in master reduces : we are finding that UI and related codepath is one of the portions which seems to be occupying a lot of memory in the OOM dumps of master. web ui should not remove executors if they are dead --- Key: SPARK-2064 URL: https://issues.apache.org/jira/browse/SPARK-2064 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We should always show the list of executors that have ever been connected, and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020976#comment-14020976 ] Patrick Wendell commented on SPARK-2064: I don't think OOM is an issue here - but I think this used to be the behavior and users requested that we clean up the old executors because otherwise for a long running service you get a really large list. Maybe we should have a timeout. web ui should not remove executors if they are dead --- Key: SPARK-2064 URL: https://issues.apache.org/jira/browse/SPARK-2064 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We should always show the list of executors that have ever been connected, and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021008#comment-14021008 ] Mridul Muralidharan commented on SPARK-2064: Unfortunately OOM is a very big issue for us since application master is single point of failure when running in yarn. Particularly when memory is constrained and vigorously enforced by the yarn containers (requiring higher overheads to be specified reducing usable memory even further. Given this, and given the fair churn already for executor containers, I am hesitant about features which add to the memory footprint for UI even further. The cumulative impact of ui is nontrivial as I mentioned before. This, for example, would require 1-8% of master memory when there is reasonable churn for long running jobs (30 hours) on reasonable number of executors (200-300). web ui should not remove executors if they are dead --- Key: SPARK-2064 URL: https://issues.apache.org/jira/browse/SPARK-2064 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We should always show the list of executors that have ever been connected, and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021011#comment-14021011 ] Mridul Muralidharan commented on SPARK-2064: I am probably missing the intent behind this change. What is the expected use case it is supposed to help with ? web ui should not remove executors if they are dead --- Key: SPARK-2064 URL: https://issues.apache.org/jira/browse/SPARK-2064 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We should always show the list of executors that have ever been connected, and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead
[ https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021018#comment-14021018 ] Reynold Xin commented on SPARK-2064: One thing is we can help identify executors that are dead, which is often important for debugging (finding out why they are dead - maybe disk space full resulting system irresponsive, etc). It is often also very useful information to have for spot instances on EC2 where executors might just die. If memory is the problem, we can cap the number of dead executors the UI tracks; alternatively, we can put the list of dead executors onto external storage (a sqlite database or even just text file in the log directory). web ui should not remove executors if they are dead --- Key: SPARK-2064 URL: https://issues.apache.org/jira/browse/SPARK-2064 Project: Spark Issue Type: Sub-task Reporter: Reynold Xin We should always show the list of executors that have ever been connected, and add a status column to mark them as dead if they have been disconnected. -- This message was sent by Atlassian JIRA (v6.2#6252)