Github user shahidki31 commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r237154542 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -222,29 +223,20 @@ private[spark] class AppStatusStore( val indices = quantiles.map { q => math.min((q * count).toLong, count - 1) } def scanTasks(index: String)(fn: TaskDataWrapper => Long): IndexedSeq[Double] = { - Utils.tryWithResource( - store.view(classOf[TaskDataWrapper]) - .parent(stageKey) - .index(index) - .first(0L) - .closeableIterator() - ) { it => - var last = Double.NaN - var currentIdx = -1L - indices.map { idx => - if (idx == currentIdx) { - last - } else { - val diff = idx - currentIdx - currentIdx = idx - if (it.skip(diff - 1)) { - last = fn(it.next()).toDouble - last - } else { - Double.NaN - } - } - }.toIndexedSeq + val quantileTasks = store.view(classOf[TaskDataWrapper]) --- End diff -- @srowen Yes. everything is loaded in "sorted" order based on index, and then we do filtering. For In memory case, this doesn't cause any issue. but for diskStore extra de serialization overhead is there. May be one possible solution can be, for diskStore case, bring only first time and sort based on the corresponding indices to compute the quantiles. If the solution seems complicated, then we can tell the user that, summary metrics display the quantile summary of all the tasks, instead of completed. correct me if I am wrong
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org