[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

Giambattista (JIRA) Wed, 01 Mar 2017 06:07:46 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-17931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890226#comment-15890226
 ]


Giambattista commented on SPARK-17931:
--------------------------------------

I just wanted to report that after this change Spark is failing in executing 
long SQL statements (my case they were long insert into table statements).
The problem I was facing is very well described in this article 
https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/
Eventually, I was able to get them working again with the change below.

--- a/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala
@@ -86,7 +86,7 @@ private[spark] object TaskDescription {
     dataOut.writeInt(taskDescription.properties.size())
     taskDescription.properties.asScala.foreach { case (key, value) =>
       dataOut.writeUTF(key)
-      dataOut.writeUTF(value)
+      dataOut.writeUTF(value.substring(0, math.min(value.size, 65*1024/4)))
     }



> taskScheduler has some unneeded serialization
> ---------------------------------------------
>
>                 Key: SPARK-17931
>                 URL: https://issues.apache.org/jira/browse/SPARK-17931
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>            Reporter: Guoqiang Li
>            Assignee: Kay Ousterhout
>             Fix For: 2.2.0
>
>
> In the existing code, there are three layers of serialization
> involved in sending a task from the scheduler to an executor:
> - A Task object is serialized
> - The Task object is copied to a byte buffer that also
> contains serialized information about any additional JARs,
> files, and Properties needed for the task to execute. This
> byte buffer is stored as the member variable serializedTask
> in the TaskDescription class.
> - The TaskDescription is serialized (in addition to the serialized
> task + JARs, the TaskDescription class contains the task ID and
> other metadata) and sent in a LaunchTask message.
> While it is necessary to have two layers of serialization, so that
> the JAR, file, and Property info can be deserialized prior to
> deserializing the Task object, the third layer of deserialization is
> unnecessary (this is as a result of SPARK-2521). We should
> eliminate a layer of serialization by moving the JARs, files, and Properties
> into the TaskDescription class.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17931) taskScheduler has some unneeded serialization

Reply via email to