[GitHub] spark issue #16261: [SPARK-18836] [CORE] Serialize one copy of task metrics ...

shivaram Thu, 15 Dec 2016 23:03:12 -0800

Github user shivaram commented on the issue:

    https://github.com/apache/spark/pull/16261
  
    With the help of @kayousterhout I ran a scheduling microbenchmark (Code 
[1]) with 10000 tasks per stage on 20 m2.4xlarge machines on EC2 (160 cores). 
From 10 trials, I measured the average time taken per stage.
    Before this PR (baseline): 2526.81 ms 
    With this PR: 1741.99 ms
    
    So overall we get a 785ms improvement (~30%) in this case. To figure out 
more closely where the speedup was coming from I added a timer inside the 
function `Task.serializeWithDependencies`[2]. 
    
    Avg. Time to serialize one task without this PR: 0.119954 ms
    Avg. Time to serialize one task with this PR: 0.0556422 ms
    
    Thus we save around 0.064 ms in serialization time per task and that 
explains most of the improvements. 
    
    [1] https://gist.github.com/shivaram/c84d18512fe8ba9c047e3d2b170b9f68
    [2] 
https://github.com/apache/spark/blob/172a52f5d31337d90155feb7072381e8d5712288/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L224



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #16261: [SPARK-18836] [CORE] Serialize one copy of task metrics ...

Reply via email to