Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/23088#discussion_r236398316 --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala --- @@ -222,29 +223,20 @@ private[spark] class AppStatusStore( val indices = quantiles.map { q => math.min((q * count).toLong, count - 1) } def scanTasks(index: String)(fn: TaskDataWrapper => Long): IndexedSeq[Double] = { - Utils.tryWithResource( - store.view(classOf[TaskDataWrapper]) - .parent(stageKey) - .index(index) - .first(0L) - .closeableIterator() - ) { it => - var last = Double.NaN - var currentIdx = -1L - indices.map { idx => - if (idx == currentIdx) { - last - } else { - val diff = idx - currentIdx - currentIdx = idx - if (it.skip(diff - 1)) { - last = fn(it.next()).toDouble - last - } else { - Double.NaN - } - } - }.toIndexedSeq + val quantileTasks = store.view(classOf[TaskDataWrapper]) --- End diff -- There's a comment at the top of this method that explains why `skip` was used. It avoids deserialization of data that is not needed here, which can get quite expensive. IIRC there's about 26 metrics, and the updated code means deserializing `26 * numberOfTasks` task instances from the disk store. With large stages that can become really slow. Try creating a large stage (e.g. `sc.parallelize(1 to 100000, 100000).count()`), loading the resulting event log through the history server, and checking how long it takes to load the stage page the first time. The goal is to make it reasonably fast (I think it's currently in the 4-5s range). Too slow and it makes the page not very usable. If this makes that too slow, perhaps loading all the successful task into memory might be an ok workaround. It sucks (there is a spike in memory usage) but shouldn't be too bad (guessing 256 bytes per `TaskDataWrapper` object, 100k tasks ~ 24Mb which doesn't sound horrible).
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org