[ https://issues.apache.org/jira/browse/SPARK-25889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16669384#comment-16669384 ]
DB Tsai commented on SPARK-25889: --------------------------------- The problem sounds valid, and the solution could work. Since I'm not in the field of job scheduler, ping [~vanzin] for expert's input. Thanks for the write-up. > Dynamic allocation load-aware ramp up > ------------------------------------- > > Key: SPARK-25889 > URL: https://issues.apache.org/jira/browse/SPARK-25889 > Project: Spark > Issue Type: New Feature > Components: Scheduler, YARN > Affects Versions: 2.3.2 > Reporter: Adam Kennedy > Priority: Major > > The time based exponential ramp up behavior for dynamic allocation is naive > and destructive, making it very difficult to run very large jobs. > On a large (64,000 core) YARN cluster with a high number of input partitions > (200,000+) the default dynamic allocation approach of requesting containers > in waves, doubling exponentially once per second, results in 50% of the > entire cluster being requested in the final 1 second wave. > This can easily overwhelm RPC processing, or cause expensive Executor startup > steps to break systems. With the interval so short, many additional > containers may be requested beyond what is actually needed and then complete > very little work before sitting around waiting to be deallocated. > Delaying the time between these fixed doublings only has limited impact. > Setting double intervals to once per minute would result in a very slow ramp > up speed, at the end of which we still face large potentially crippling waves > of executor startup. > An alternative approach to spooling up large job appears to be needed, which > is still relatively simple but could be more adaptable to different cluster > sizes and differing cluster and job performance. > I would like to propose a few different approaches based around the general > idea of controlling outstanding requests for new containers based on the > number of executors that are currently running, for some definition of > "running". > One example might be to limit requests to one new executor for every existing > executor that currently has an active task. Or some ratio of that, to allow > for more or less aggressive spool up. A lower number would let us approximate > something like fibonacci ramp up, a higher number of say 2x would spool up > quickly, but still aligned with the rate at which broadcast blocks can be > easily distributed to new members. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org