[ 
https://issues.apache.org/jira/browse/SPARK-21084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16048817#comment-16048817
 ] 

Sean Owen commented on SPARK-21084:
-----------------------------------

These are really resource manager concerns. For example, you can (and should) 
limit the amount of resource a user or group can consume from YARN if you need 
to prevent one user from taking all of the resources. It's even possible to let 
people use more than their limit as long as nobody else needs it.

Request latency -- I assume you mean waiting for executor count to ramp up. You 
can solve this by requesting more executors upfront, increasing the initial 
executor count.

3 and 4 aren't really problems if you're enforcing per-user limits. If a user 
is entitled to resources, they are entitled, and it's correct that executors 
stay alive if desired to hold on to cached data. If they're not, then they will 
already be preempted when someone else needs them.

4 is in conflict with 3, right? you're saying you want cached data to be 
preempted, but that it's a problem too. Yes, these are the effects, but, those 
are also the balanced concerns.

Before going any further, what specifically are you proposing that isn't 
already covered by Spark or resource managers? nothing here yet sounds like 
that.

> Improvements to dynamic allocation for notebook use cases
> ---------------------------------------------------------
>
>                 Key: SPARK-21084
>                 URL: https://issues.apache.org/jira/browse/SPARK-21084
>             Project: Spark
>          Issue Type: Umbrella
>          Components: Block Manager, Scheduler, Spark Core, YARN
>    Affects Versions: 2.2.0, 2.3.0
>            Reporter: Frederick Reiss
>
> One important application of Spark is to support many notebook users with a 
> single YARN or Spark Standalone cluster.  We at IBM have seen this 
> requirement across multiple deployments of Spark: on-premises and private 
> cloud deployments at our clients, as well as on the IBM cloud.  The scenario 
> goes something like this: "Every morning at 9am, 500 analysts log into their 
> computers and start running Spark notebooks intermittently for the next 8 
> hours." I'm sure that many other members of the community are interested in 
> making similar scenarios work.
>     
> Dynamic allocation is supposed to support these kinds of use cases by 
> shifting cluster resources towards users who are currently executing scalable 
> code.  In our own testing, we have encountered a number of issues with using 
> the current implementation of dynamic allocation for this purpose:
> *Issue #1: Starvation.* A Spark job acquires all available containers, 
> preventing other jobs or applications from starting.
> *Issue #2: Request latency.* Jobs that would normally finish in less than 30 
> seconds take 2-4x longer than normal with dynamic allocation.
> *Issue #3: Unfair resource allocation due to cached data.* Applications that 
> have cached RDD partitions hold onto executors indefinitely, denying those 
> resources to other applications.
> *Issue #4: Loss of cached data leads to thrashing.*  Applications repeatedly 
> lose partitions of cached RDDs because the underlying executors are removed; 
> the applications then need to rerun expensive computations.
>     
> This umbrella JIRA covers efforts to address these issues by making 
> enhancements to Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to