[
https://issues.apache.org/jira/browse/AURORA-1096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304513#comment-14304513
]
Maxim Khutornenko commented on AURORA-1096:
-------------------------------------------
Right. Hence the second part. If we are to apply failure settings to every
instance we may penalize large services by not allowing higher failure
tolerances. Also, if {{rollback_on_failure}} is True, should we also account
for it in the cap? Perhaps just letting the update to proceed with warning like
"Your update may fail due to exceeding the allowed event cap" could be a better
alternative to outright rejecting it.
> Scheduler updater should limit the number of job/instance events
> ----------------------------------------------------------------
>
> Key: AURORA-1096
> URL: https://issues.apache.org/jira/browse/AURORA-1096
> Project: Aurora
> Issue Type: Story
> Components: Scheduler
> Reporter: Maxim Khutornenko
>
> Large/flapping scheduler job updates may generate too many events in the
> update store. The update settings are fully controlled by the user and there
> is a potential for a misconfigured job update to completely overwhelm our
> in-memory DB storage with job update instance events.
> For example, a large flapping update with {{max_per_shard_failures}} and
> {{max_total_failures}} set to max INT when left unattended can quickly
> consume all available RAM and kill the scheduler. A manual cleanup of the
> scheduler log would be needed to bring the scheduler up.
> This can be especially relevant with the introduction of update heartbeats
> (AURORA-690) that can further exacerbate the problem (e.g. when
> {{blockIfNoPulseAfterMs}} set too low wrt the external service pulse rate).
> We need to cap the max per-job lifetime count of {{JobUpdateEvent}} and
> {{JobInstanceUpdateEvent}} instances. A nice bonus would be providing a hint
> in the UI when the event sequence is cut off.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)