Hi,

I don't think the assumption for T2 is correct. For example if there is fetch failure (that needs recompuration in a earlier stage) for the first task in a stage that will cause the stage be retried and most of the actual work will happen with `stage-attempt-number` > 0.

I believe the correct way to do this calculation on the task level and look which task has run multiple times and calculate based on that. But even then there is things like persist that will mean that the same task might do different amounts of work so it might not be clear which cpu time should be used for the "not wasted" task.

/ Emil

On 14/10/2022 14:54, Faiz Halde wrote:
Hello,

We run our spark workloads on spot and we would like to quantify the impact of spot interruptions on our workloads. We are proposing the following metric but would like your opinions on it

We are leveraging Spark's Event Listener and performing the following

T = task

T1 = sum(T.execution-time) for all T where T.status=failed and T.stage-attempt-number = 0

T2 = sum(T.execution-time) for all T where T.stage-attempt-number > 0

Tall = sum(T.execution-time)

Retry% = (T1 + T2) / Tall

The assumption is that

T1 – IF a stage is executing for the first time then only tasks that failed was waste T2 – every task executed for a stage with stage-attempt-number > 0 is a retry since the stage was succeeded previously


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to