[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020789#comment-14020789
 ] 

Mridul Muralidharan commented on SPARK-2064:


Depending on how long a job runs, this can cause OOM on the master.
In yarn (and mesos ?) an executor on the same node gets different port if 
relaunched on failure - and so end up as different executor in the list.

 web ui should not remove executors if they are dead
 ---

 Key: SPARK-2064
 URL: https://issues.apache.org/jira/browse/SPARK-2064
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin

 We should always show the list of executors that have ever been connected, 
 and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020897#comment-14020897
 ] 

Reynold Xin commented on SPARK-2064:


Is memory really an issue here?

On a 1000 node cluster, let's say we need 1KB to track each executor (should be 
more than enough), then we need 1MB to track all of them. In less than 100MB, 
we can crash  restart all of them 100 times.

If it really becomes the problem perhaps we can clean dead ones after a certain 
time period.

 web ui should not remove executors if they are dead
 ---

 Key: SPARK-2064
 URL: https://issues.apache.org/jira/browse/SPARK-2064
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin

 We should always show the list of executors that have ever been connected, 
 and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020936#comment-14020936
 ] 

Mridul Muralidharan commented on SPARK-2064:


It is 100 MB (or more) of memory which could be used elsewhere.
In our clusters, for example, the number of workers can be very high while the 
containers can be quite ephemeral when under load (and so lot of container 
losses); on other hand, memory per container is constrained to about 8 gig 
(lower when we account for overheads, etc).

So the amount of working memory in master reduces : we are finding that UI and 
related codepath is one of the portions which seems to be occupying a lot of 
memory in the OOM dumps of master.

 web ui should not remove executors if they are dead
 ---

 Key: SPARK-2064
 URL: https://issues.apache.org/jira/browse/SPARK-2064
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin

 We should always show the list of executors that have ever been connected, 
 and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14020976#comment-14020976
 ] 

Patrick Wendell commented on SPARK-2064:


I don't think OOM is an issue here - but I think this used to be the behavior 
and users requested that we clean up the old executors because otherwise for a 
long running service you get a really large list. Maybe we should have a 
timeout.

 web ui should not remove executors if they are dead
 ---

 Key: SPARK-2064
 URL: https://issues.apache.org/jira/browse/SPARK-2064
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin

 We should always show the list of executors that have ever been connected, 
 and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021008#comment-14021008
 ] 

Mridul Muralidharan commented on SPARK-2064:


Unfortunately OOM is a very big issue for us since application master is single 
point of failure when running in yarn.
Particularly when memory is constrained and vigorously enforced by the yarn 
containers (requiring higher overheads to be specified reducing usable memory 
even further.

Given this, and given the fair churn already for executor containers, I am 
hesitant about features which add to the memory footprint for UI even further. 
The cumulative impact of ui is nontrivial as I mentioned before. This, for 
example, would require 1-8% of master memory when there is reasonable churn for 
long running jobs (30 hours) on reasonable number of executors (200-300).


 web ui should not remove executors if they are dead
 ---

 Key: SPARK-2064
 URL: https://issues.apache.org/jira/browse/SPARK-2064
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin

 We should always show the list of executors that have ever been connected, 
 and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021011#comment-14021011
 ] 

Mridul Muralidharan commented on SPARK-2064:


I am probably missing the intent behind this change.
What is the expected use case it is supposed to help with ?

 web ui should not remove executors if they are dead
 ---

 Key: SPARK-2064
 URL: https://issues.apache.org/jira/browse/SPARK-2064
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin

 We should always show the list of executors that have ever been connected, 
 and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2064) web ui should not remove executors if they are dead

2014-06-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14021018#comment-14021018
 ] 

Reynold Xin commented on SPARK-2064:


One thing is we can help identify executors that are dead, which is often 
important for debugging (finding out why they are dead - maybe disk space full 
resulting system irresponsive, etc). It is often also very useful information 
to have for spot instances on EC2 where executors might just die.

If memory is the problem, we can cap the number of dead executors the UI 
tracks; alternatively, we can put the list of dead executors onto external 
storage (a sqlite database or even just text file in the log directory).

 web ui should not remove executors if they are dead
 ---

 Key: SPARK-2064
 URL: https://issues.apache.org/jira/browse/SPARK-2064
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin

 We should always show the list of executors that have ever been connected, 
 and add a status column to mark them as dead if they have been disconnected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)