Hi,
I don't think the assumption for T2 is correct. For example if there is
fetch failure (that needs recompuration in a earlier stage) for the
first task in a stage that will cause the stage be retried and most of
the actual work will happen with `stage-attempt-number` > 0.
I believe the correct way to do this calculation on the task level and
look which task has run multiple times and calculate based on that. But
even then there is things like persist that will mean that the same task
might do different amounts of work so it might not be clear which cpu
time should be used for the "not wasted" task.
/ Emil
On 14/10/2022 14:54, Faiz Halde wrote:
Hello,
We run our spark workloads on spot and we would like to quantify the
impact of spot interruptions on our workloads. We are proposing the
following metric but would like your opinions on it
We are leveraging Spark's Event Listener and performing the following
T = task
T1 = sum(T.execution-time) for all T where T.status=failed and
T.stage-attempt-number = 0
T2 = sum(T.execution-time) for all T where T.stage-attempt-number > 0
Tall = sum(T.execution-time)
Retry% = (T1 + T2) / Tall
The assumption is that
T1 – IF a stage is executing for the first time then only tasks that
failed was waste
T2 – every task executed for a stage with stage-attempt-number > 0 is a
retry since the stage was succeeded previously
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org