[ 
https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kay Ousterhout updated SPARK-18890:
-----------------------------------
    Description: 
 As part of benchmarking this change: 
https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and I 
found that moving task serialization from TaskSetManager (which happens as part 
of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads to 
approximately a 10% reduction in job runtime for a job that counted 10,000 
partitions (that each had 1 int) using 20 machines.  Similar performance 
improvements were reported in the pull request linked above.  This would appear 
to be because the TaskSchedulerImpl thread is the bottleneck, so moving 
serialization to CGSB reduces runtime.  This change may *not* improve runtime 
(and could potentially worsen runtime) in scenarios where the CGSB thread is 
the bottleneck (e.g., if tasks are very large, so calling launch to send the 
tasks to the executor blocks on the network).

One benefit of implementing this change is that it makes it easier to 
parallelize the serialization of tasks (different tasks could be serialized by 
different threads).  Another benefit is that all of the serialization occurs in 
the same place (currently, the Task is serialized in TaskSetManager, and the 
TaskDescription is serialized in CGSB).

I'm not totally convinced we should fix this because it seems like there are 
better ways of reducing the serialization time (e.g., by re-using a single 
serialized object with the Task/jars/files and broadcasting it for each stage) 
but I wanted to open this JIRA to document the discussion.

cc [~witgo]



  was:
 As part of benchmarking this change: 
https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and I 
found that moving task serialization from TaskSetManager (which happens as part 
of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads to 
approximately a 10% reduction in job runtime for a job that counted 10,000 
partitions (that each had 1 int) using 20 machines.  Similar performance 
improvements were reported in the pull request linked above.  This would appear 
to be because the TaskSchedulerImpl thread is the bottleneck, so moving 
serialization to CGSB reduces runtime.  This change may *not* improve runtime 
(and could potentially worsen runtime) in scenarios where the CGSB thread is 
the bottleneck (e.g., if tasks are very large, so calling launch to send the 
tasks to the executor blocks on the network).

One benefit of implementing this change is that it makes it easier to 
parallelize the serialization of tasks (different tasks could be serialized by 
different threads).  Another benefit is that all of the serialization occurs in 
the same place (currently, the Task is serialized in TaskSetManager, and the 
TaskDescription is serialized in CGSB).

I'm not totally convinced we should fix this because it seems like there are 
better ways of reducing the serialization time (e.g., by re-using the Task 
object within a stage) but I wanted to open this JIRA to document the 
discussion.

cc [~witgo]




> Do all task serialization in CoarseGrainedExecutorBackend thread (rather than 
> TaskSchedulerImpl)
> ------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18890
>                 URL: https://issues.apache.org/jira/browse/SPARK-18890
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Kay Ousterhout
>            Priority: Minor
>
>  As part of benchmarking this change: 
> https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and 
> I found that moving task serialization from TaskSetManager (which happens as 
> part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads 
> to approximately a 10% reduction in job runtime for a job that counted 10,000 
> partitions (that each had 1 int) using 20 machines.  Similar performance 
> improvements were reported in the pull request linked above.  This would 
> appear to be because the TaskSchedulerImpl thread is the bottleneck, so 
> moving serialization to CGSB reduces runtime.  This change may *not* improve 
> runtime (and could potentially worsen runtime) in scenarios where the CGSB 
> thread is the bottleneck (e.g., if tasks are very large, so calling launch to 
> send the tasks to the executor blocks on the network).
> One benefit of implementing this change is that it makes it easier to 
> parallelize the serialization of tasks (different tasks could be serialized 
> by different threads).  Another benefit is that all of the serialization 
> occurs in the same place (currently, the Task is serialized in 
> TaskSetManager, and the TaskDescription is serialized in CGSB).
> I'm not totally convinced we should fix this because it seems like there are 
> better ways of reducing the serialization time (e.g., by re-using a single 
> serialized object with the Task/jars/files and broadcasting it for each 
> stage) but I wanted to open this JIRA to document the discussion.
> cc [~witgo]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to