Re: [DISCUSS] CIP-17 Interruption Aware Slot Selection

Aravind Patnam Tue, 13 May 2025 14:47:09 -0700

Hi Nicholas,

Thanks for the feedback and questions! Let me answer your questions below.


> 1. Why is the lack of Flink replication support the motivation for
disruption aware slot selection? Do you mean that disruption aware slot
selection helps to reduce recompution costs for Flink without replication
support?
Yes, this would significantly reduce recomputation costs. This is
especially in an environment like ours internally, where we experience
constant interruptions throughout the day to a large set of workers. By
proactively only using workers that won't get interrupted, we can minimize
this recomputation cost and reduce the task failures.

> 2. Can you provide a complete definition of the /updateInterruptionNotice
interface? Meanwhile, could you also provide the definition of
corresponding CLI interface?
Yes, sure - I have updated the CIP google doc to now include the interface
<https://docs.google.com/document/d/16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw/edit?pli=1&tab=t.0#heading=h.96cl46gz9o8l>
now. In summary, it would be a PUT api, which would take in a List of
*UpdateInterruptionNoticeRequests.
*This object's fields would simply be the worker hostname along with the
next known interruption timestamp in milliseconds. Every time this API is
called, it will reset the workers not included in the request back to no
nextKnownDisruption automatically.

> 3. How is the performance of disruption aware slot selection? Which
scenario could users use disruption aware slot selection?
Internally, we have enabled this feature and we see great improvements to
our task failures. You can see the results here
<https://docs.google.com/document/d/16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw/edit?pli=1&tab=t.0#bookmark=id.asvbrvv0c9q6>
--
we roughly see *20x decrease for Spark task failures related to
interruptions*. This is because we have a high rate of interruptions
internally. Other users with similar environments would greatly benefit
from this feature, as they can enable this to prioritize workers that will
not get interrupted.

> 4. How could the cluster administrator determine
workersWithLateInterruptions and workersWithEarlyInterruptions? BTW, how
does the administrator evaluate the threshold percentile based on the range
of interruption timestamps?
This sort of depends on how the interruptions are scheduled and how much
advance notice is present, along with the application characteristics in
the cluster. We plan on distinguishing the workers as follows in the code:


   -

   prioritizedWorkers = workersWithNoInterruptions +
   workersWithLateInterruptions
   -

   deprioritizedWorkers = workersWithEarlyInterruptions


The calculation is:

   - workersWithLateInterruptions = percentageThreshold *
   sorted(totalWorkers)
   - workersWithEarlyInterruptions = totalWorkers -
   workersWithLateInterruptions


This means that the higher the percentage threshold, the higher the size of
workersWithLateInterruptions, which means the higher the number of
prioritizedWorkers.

For example, if you have a larger advance notice (such as 1 week or so), it
makes sense to set this percentage threshold higher (e.g. 90%). If you have
a smaller advanced notice window, it makes sense to set it something lower
(e.g. 50%).

Consider this example where there are 10 workers with interruption
timestamps t5-t14, and current time t0:

w1

w2

w3

w4

w5

w6

w7

w8

w9

w10

t5

t6

t7

t8

t9

t10

t11

t12

t13

t14

If we set percentageThreshold to 50%, then the following would occur:

   - w1 - w5 would be considered workersWithEarlyInterruptions
   - w6 - w10 would be considered workersWithLateInterruptions

Given current time is only t0 and is far away from t5, this means we are
probably wasting workers w1 - w5 since they have a long time to be
disrupted anyways. Hence, in this case it would make more sense to set a
higher percentage threshold such as 90%. Then it would be like this:

   - w1 would be considered workersWithEarlyInterruptions
   - w2 - w10 would be considered workersWithLateInterruptions

This is better for the cluster given a larger advanced notice.

However, if there is a smaller advanced notice, consider this example:

w1

w2

w3

w4

w5

w6

w7

w8

w9

w10

t1

t2

t3

t4

t5

t6

t7

t8

t9

t10

If we set percentageThreshold to 90%, then the following would occur:

   - w1 would be considered workersWithEarlyInterruptions
   - w2 - w10 would be considered workersWithLateInterruptions

Given current time is only t0 and is close to t1 or t2, medium/long running
jobs would probably incur failures when w2-w10 go down. Hence in this case,
it might be better to set to a lower percentage, such as 50%. Then it would
be like this:

   - w1-5 would be considered workersWithEarlyInterruptions
   - w6 - w10 would be considered workersWithLateInterruptions

This is better for the cluster given a smaller advanced notice.

Overall, the cluster administrator should take into account the
interruption advanced notice time, along with the average/median job
runtime in the cluster before setting this value.
We can add more documentation on how to set this value once we start
implementing it.

Hope this answers all your questions!

Thanks,
Aravind


On Tue, May 13, 2025 at 9:40 AM Nicholas Jiang <[email protected]>
wrote:

> Hi Aravind,
>
> Thanks for driving the proposal of interruption aware slot selection. I
> have some comments for this proposal:
>
> 1. Why is the lack of Flink replication support the motivation for
> disruption aware slot selection? Do you mean that disruption aware slot
> selection helps to reduce recompution costs for Flink without replication
> support?
>
> 2. Can you provide a complete definition of the /updateInterruptionNotice
> interface? Meanwhile, could you also provide the definition of
> corresponding CLI interface?
>
> 3. How is the performance of disruption aware slot selection? Which
> scenario could users use disruption aware slot selection?
>
> 4. How could the cluster administrator determine
> workersWithLateInterruptions and workersWithEarlyInterruptions? BTW, how
> does the administrator evaluate the threshold percentile based on the range
> of interruption timestamps?
>
> Regards,
> Nicholas Jiang
>
> On 2025/05/02 07:40:15 Aravind Patnam wrote:
> > Hi Celeborn community,
> >
> > I have written up CIP-17: Interruption Aware Slot Selection
> > <
> https://docs.google.com/document/d/16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw/edit?usp=sharing
> >.
> > Please review and let me know if there are any comments or questions.
> >
> > This is a feature we have introduced internally, given our heavy volume
> of
> > interruptions. We have seen substantial decrease in task failures in both
> > Flink and Spark jobs, and think the community would also benefit from
> this
> > :)
> >
> > Looking forward to getting feedback from the community.
> >
> > Thanks,
> > Aravind
> >
> >  CIP 17: Interruption Aware Slot Selection
> > <
> https://drive.google.com/open?id=16Lj4KadSb6ypaXTg5tJB0QvaXG8vTLtyoj7V4umTZqw
> >
> >
>


-- 
Aravind K. Patnam

Re: [DISCUSS] CIP-17 Interruption Aware Slot Selection

Reply via email to