Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/23088#discussion_r236398316
  
    --- Diff: core/src/main/scala/org/apache/spark/status/AppStatusStore.scala 
---
    @@ -222,29 +223,20 @@ private[spark] class AppStatusStore(
         val indices = quantiles.map { q => math.min((q * count).toLong, count 
- 1) }
     
         def scanTasks(index: String)(fn: TaskDataWrapper => Long): 
IndexedSeq[Double] = {
    -      Utils.tryWithResource(
    -        store.view(classOf[TaskDataWrapper])
    -          .parent(stageKey)
    -          .index(index)
    -          .first(0L)
    -          .closeableIterator()
    -      ) { it =>
    -        var last = Double.NaN
    -        var currentIdx = -1L
    -        indices.map { idx =>
    -          if (idx == currentIdx) {
    -            last
    -          } else {
    -            val diff = idx - currentIdx
    -            currentIdx = idx
    -            if (it.skip(diff - 1)) {
    -              last = fn(it.next()).toDouble
    -              last
    -            } else {
    -              Double.NaN
    -            }
    -          }
    -        }.toIndexedSeq
    +      val quantileTasks = store.view(classOf[TaskDataWrapper])
    --- End diff --
    
    There's a comment at the top of this method that explains why `skip` was 
used. It avoids deserialization of data that is not needed here, which can get 
quite expensive. IIRC there's about 26 metrics, and the updated code means 
deserializing `26 * numberOfTasks` task instances from the disk store. With 
large stages that can become really slow.
    
    Try creating a large stage (e.g. `sc.parallelize(1 to 100000, 
100000).count()`), loading the resulting event log through the history server, 
and checking how long it takes to load the stage page the first time. The goal 
is to make it reasonably fast (I think it's currently in the 4-5s range). Too 
slow and it makes the page not very usable.
    
    If this makes that too slow, perhaps loading all the successful task into 
memory might be an ok workaround. It sucks (there is a spike in memory usage) 
but shouldn't be too bad (guessing 256 bytes per `TaskDataWrapper` object, 100k 
tasks ~ 24Mb which doesn't sound horrible).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to