[jira] [Commented] (SPARK-19108) Broadcast all shared parts of tasks (to reduce task serialization time)

Imran Rashid (JIRA) Mon, 09 Jan 2017 08:13:18 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15812130#comment-15812130
 ]


Imran Rashid commented on SPARK-19108:
--------------------------------------

yeah, great point.  I was thinking two broadcasts just for minimal code changes 
but this would already be a big enough refactor it makes sense to pull them 
together.

> Broadcast all shared parts of tasks (to reduce task serialization time)
> -----------------------------------------------------------------------
>
>                 Key: SPARK-19108
>                 URL: https://issues.apache.org/jira/browse/SPARK-19108
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>            Reporter: Kay Ousterhout
>
> Expand the amount of information that's broadcasted for tasks, to avoid 
> serializing data per-task that should only be sent to each executor once for 
> the entire stage.
> Conceptually, this means we'd have new classes  specially for sending the 
> minimal necessary data to the executor, like:
> {code}
> /**
>   * metadata about the taskset needed by the executor for all tasks in this 
> taskset.  Subset of the
>   * full data kept on the driver to make it faster to serialize and send to 
> executors.
>   */
> class ExecutorTaskSetMeta(
>   val stageId: Int,
>   val stageAttemptId: Int,
>   val properties: Properties,
>   val addedFiles: Map[String, String],
>   val addedJars: Map[String, String]
>   // maybe task metrics here?
> )
> class ExecutorTaskData(
>   val partitionId: Int,
>   val attemptNumber: Int,
>   val taskId: Long,
>   val taskBinary: Broadcast[Array[Byte]],
>   val taskSetMeta: Broadcast[ExecutorTaskSetMeta]
> )
> {code}
> Then all the info you'd need to send to the executors would be a serialized 
> version of ExecutorTaskData.  Furthermore, given the simplicity of that 
> class, you could serialize manually, and then for each task you could just 
> modify the first two ints & one long directly in the byte buffer.  (You could 
> do the same trick for serialization even if ExecutorTaskSetMeta was not a 
> broadcast, but that will keep the msgs small as well.)
> There a bunch of details I'm skipping here: you'd also need to do some 
> special handling for the TaskMetrics; the way tasks get started in the 
> executor would change; you'd also need to refactor {{Task}} to let it get 
> reconstructed from this information (or add more to ExecutorTaskSetMeta); and 
> probably other details I'm overlooking now.
> (this is copied from SPARK-18890 and [~imranr]'s comment there; cc 
> [~shivaram])



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19108) Broadcast all shared parts of tasks (to reduce task serialization time)

Reply via email to