Hello, Due to the way Spark implements shuffle, a loss of an executor sometimes results in the recomputation of partitions that were lost
The definition of a *partition* is the tuple ( RDD-ids, partition id ) RDD-ids is a sequence of RDD ids In our system, we define the unit of work performed for a job X as work = count of tasks executed to complete the job X We want to be able to segregate the *goodput* from this metric Goodput is defined as - Had there been 0 failures in a cluster, how many tasks spark had to compute to complete this job Using the event listener, would the following work? 1. Build a hashmap of type [(RDD-ids, partition), int] with default value = 0 2. For each task T, hashmap[(T.RDD-ids, T.partition-id)] += 1 The assumption here is that spark will never recompute a *partition* twice ( when there are no failures ). Is this assumption true? So for any entry, a value of greater than 1 means that the particular partition identified by the tuple ( RDD-ids, partition id ) was recomputed because spark thought the partition was "lost" Given the above data structure, the recomputation cost would be 1 - (hashmap.size() / sum(hashmap.values)) Thanks Faiz