[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16289614#comment-16289614 ]
Thomas Graves commented on SPARK-22683: --------------------------------------- So the issue brought up here seems to be resource waste vs run time. I agree that resource waste is an important thing. There are lots of jobs that have very short tasks. It might be nice to update the description to go into more details what the issue you are seeing rather then talking about your proposed solution. You mention fast tasks end up getting run on small # of executors. this can actually be beneficial to resource usage right? no need to get more if they run fast enough on small number, the downside here is if we have requested them from yarn and are getting them then it wastes for that startup/shutdown period. the other thing you mention though is this affects shuffle later, you can ask spark to wait longer for a minimal number to help with the shuffle issue (also set spark.dynamicAllocation.initialExecutors if its an early stage) which again adversely affects resource usage. This is another balance point though which goes against what you originally asked I think unless you actually put in another config that tries to enforce spreading those. I agree with Sean on the point that I think this would be a hard thing to optimize for all jobs, long running vs short. The fact you are asking for 5+cores per executor will naturally waste more resources when the executor isn't being used, that is inherent and until we go to resizing those (quickly) will always be an issue. But if we can find something that by defaults works better for the majority of workloads that it makes sense to improve. As with any config though, how do I know what to set the tasksPerSlot as? it requires configuration and it could affect performance. The reason I was told the dynamic allocation exponential ramps up is to try to allow the quick tasks to run on existing executors before asking for more. You are essentially saying this isn't working well enough. But is it not working well enough because we are doing the exponential ask? What if we ask for all up front like MR does? I see you made one comment about executors not used started didn't get many run on it and then idle timed out so that might not help here, but the question is would yarn give you more immediately if you asked for them all first. Were your benchmarks done on a busy cluster or a empty cluster, how fast was your container allocation, did you hit other user limits, etc. how many executors and how quickly they are allocated will be affected by those things. Above you say "When running with 6 tasks per executor slot, our Spark jobs consume in average 30% less vcorehours than the MR jobs, this setting being valid for different workload sizes." Was this with this patch applied or without? in https://issues.apache.org/jira/browse/SPARK-22683?focusedCommentId=16286032&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16286032 . The WallTimeGain wrt MR (%) , does this mean positive numbers ran faster then MR? why is running with 6 or 8 slower? is it shuffle issues or mistuning with gc, or just unknown overhead? > DynamicAllocation wastes resources by allocating containers that will barely > be used > ------------------------------------------------------------------------------------ > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Affects Versions: 2.1.0, 2.2.0 > Reporter: Julien Cuquemelle > Labels: pull-request-available > > let's say an executor has spark.executor.cores / spark.task.cpus taskSlots > The current dynamic allocation policy allocates enough executors > to have each taskSlot execute a single task, which minimizes latency, > but wastes resources when tasks are small regarding executor allocation > and idling overhead. > By adding the tasksPerExecutorSlot, it is made possible to specify how many > tasks > a single slot should ideally execute to mitigate the overhead of executor > allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org