Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r237164158 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -222,29 +223,20 @@ private[spark] class AppStatusStore( val indices = quantiles.map { q => math.min((q * count).toLong, count - 1) } def scanTasks(index: String)(fn: TaskDataWrapper => Long): IndexedSeq[Double] = { - Utils.tryWithResource( - store.view(classOf[TaskDataWrapper]) - .parent(stageKey) - .index(index) - .first(0L) - .closeableIterator() - ) { it => - var last = Double.NaN - var currentIdx = -1L - indices.map { idx => - if (idx == currentIdx) { - last - } else { - val diff = idx - currentIdx - currentIdx = idx - if (it.skip(diff - 1)) { - last = fn(it.next()).toDouble - last - } else { - Double.NaN - } - } - }.toIndexedSeq + val quantileTasks = store.view(classOf[TaskDataWrapper]) --- End diff -- I see so there's no real way to do it right without deserializing everything because that's the only way to know for sure what's successful. Leaving it as-is isn't so bad; if failed tasks are infrequent then the quantiles are still about right. I can think of fudges like searching ahead from the current index for the next successful task and using that. For the common case that might make the quantiles a little better. Iterate ahead, incrementing idx, and give up when hitting the next index. Hm @vanzin is that too complex vs taking the hit and deserializing it all? or just punt on this?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org