[jira] [Commented] (SPARK-1453) Improve the way Spark on Yarn waits for executors before starting
[ https://issues.apache.org/jira/browse/SPARK-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968374#comment-13968374 ] Thomas Graves commented on SPARK-1453: -- Actually there are 2 timeouts. The one you mention which is a max. Then one in YarnClusterScheduler which I think is basically always another 3 seconds. Ideally I think this should be b) by default with possibly the option for c), where c means I want x% but if I don't get it within a certain amount of time go ahead and run because I know my application will run ok with less resources (just not run optimally). I don't see a reason to do d). If you have submitted your application then you want something to run. If it exits then you have wasted all that time waiting. I would rather the user just kill it if they have that tight of sla's. Or they should get their own queue or reconfigure their queue. I'd be ok with adding the option for d if some power users want it. I think for most normal users b is the best default behavior though. If possible we should tell the user why its waiting too. Improve the way Spark on Yarn waits for executors before starting - Key: SPARK-1453 URL: https://issues.apache.org/jira/browse/SPARK-1453 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Thomas Graves Currently Spark on Yarn just delays a few seconds between when the spark context is initialized and when it allows the job to start. If you are on a busy hadoop cluster is might take longer to get the number of executors. In the very least we could make this timeout a configurable value. Its currently hardcoded to 3 seconds. Better yet would be to allow user to give a minimum number of executors it wants to wait for, but that looks much more complex. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1453) Improve the way Spark on Yarn waits for executors before starting
[ https://issues.apache.org/jira/browse/SPARK-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968390#comment-13968390 ] Mridul Muralidharan commented on SPARK-1453: (d) becomes relevant in case of headless/cron'ed jobs. If the job is user initiated, then I agree, the user would typically kill and restart the job. Improve the way Spark on Yarn waits for executors before starting - Key: SPARK-1453 URL: https://issues.apache.org/jira/browse/SPARK-1453 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Thomas Graves Currently Spark on Yarn just delays a few seconds between when the spark context is initialized and when it allows the job to start. If you are on a busy hadoop cluster is might take longer to get the number of executors. In the very least we could make this timeout a configurable value. Its currently hardcoded to 3 seconds. Better yet would be to allow user to give a minimum number of executors it wants to wait for, but that looks much more complex. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1453) Improve the way Spark on Yarn waits for executors before starting
[ https://issues.apache.org/jira/browse/SPARK-1453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13967193#comment-13967193 ] Mridul Muralidharan commented on SPARK-1453: The timeout gets hit only when we dont get requested executors, right ? So it is more like max timeout (controlled by number of times we loop iirc). The reason for keeping it stupid was simply because we have no gaurantees of number of containers which might be available to spark in a busy cluster : at times, it might not be practically possible to even get a fraction of the requested nodes (either due to busy cluster, or because of lack of resources - so infinite wait). Ideally, I should have exposed the number of containers allocated - so that atleast user code could use it as spi and decide how to proceed for more complex cases. Missed out on this one. I am not sure which usecases make sense. a) Wait for X seconds or requested containers allocated. b) Wait until minimum of Y containers allocated (out of X requested). c) (b) with (a) - that is min containers and timeout on that. d) (c) with exit if min containers not allocated ? (d) is something which I keep hitting into (if I dont get my required minimum nodes, and job proceeds, I usually end up bringing down those nodes :-( ) Improve the way Spark on Yarn waits for executors before starting - Key: SPARK-1453 URL: https://issues.apache.org/jira/browse/SPARK-1453 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 1.0.0 Reporter: Thomas Graves Assignee: Thomas Graves Currently Spark on Yarn just delays a few seconds between when the spark context is initialized and when it allows the job to start. If you are on a busy hadoop cluster is might take longer to get the number of executors. In the very least we could make this timeout a configurable value. Its currently hardcoded to 3 seconds. Better yet would be to allow user to give a minimum number of executors it wants to wait for, but that looks much more complex. -- This message was sent by Atlassian JIRA (v6.2#6252)