Github user shivaram commented on the issue:
https://github.com/apache/spark/pull/16261
With the help of @kayousterhout I ran a scheduling microbenchmark (Code
[1]) with 10000 tasks per stage on 20 m2.4xlarge machines on EC2 (160 cores).
From 10 trials, I measured the average time taken per stage.
Before this PR (baseline): 2526.81 ms
With this PR: 1741.99 ms
So overall we get a 785ms improvement (~30%) in this case. To figure out
more closely where the speedup was coming from I added a timer inside the
function `Task.serializeWithDependencies`[2].
Avg. Time to serialize one task without this PR: 0.119954 ms
Avg. Time to serialize one task with this PR: 0.0556422 ms
Thus we save around 0.064 ms in serialization time per task and that
explains most of the improvements.
[1] https://gist.github.com/shivaram/c84d18512fe8ba9c047e3d2b170b9f68
[2]
https://github.com/apache/spark/blob/172a52f5d31337d90155feb7072381e8d5712288/core/src/main/scala/org/apache/spark/scheduler/Task.scala#L224
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]