[ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383280#comment-14383280
 ] 

Sandy Ryza commented on SPARK-4550:
-----------------------------------

Java serialization appears to write out the full class name the first time an 
object is written and then refer to it by an identifier afterwards:

{code}
scala> val baos = new ByteArrayOutputStream()
scala> val oos = new ObjectOutputStream(baos)
scala> oos.writeObject(new java.util.Date())
scala> oos.flush()

scala> baos.toString
res8: String = ��??sr??java.util.Datehj�?KYt????xpw????LY6: x 
scala> baos.toByteArray.length
res9: Int = 46

scala> oos.writeObject(new java.util.Date())
scala> oos.flush()

scala> baos.toString
res14: String = ��??sr??java.util.Datehj�?KYt????xpw????LY6: xsq?~??w????LY6�Dx 
scala> baos.toByteArray.length
res13: Int = 63

scala> oos.writeObject(new java.util.Date())
scala> oos.flush()

scala> baos.toString
res17: String = ��??sr??java.util.Datehj�?KYt????xpw????LY6: 
xsq?~??w????LY6�Dxsq?~??w????LY8?�x 
scala> baos.toByteArray.length
res18: Int = 80
{code}

There might be some fancy way to listen for the class name being written out 
and relocate that segment to the front of the stream.  However, this seems 
fairly and involved and bug-prone; my opinion is that isn't not worth it given 
that Java ser is already a severely performance-impaired option.   Another 
option of course would be to write the class name in front of every record, but 
this would bloat the serialized representation considerably.

> In sort-based shuffle, store map outputs in serialized form
> -----------------------------------------------------------
>
>                 Key: SPARK-4550
>                 URL: https://issues.apache.org/jira/browse/SPARK-4550
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
>            Priority: Critical
>         Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala
>
>
> One drawback with sort-based shuffle compared to hash-based shuffle is that 
> it ends up storing many more java objects in memory.  If Spark could store 
> map outputs in serialized form, it could
> * spill less often because the serialized form is more compact
> * reduce GC pressure
> This will only work when the serialized representations of objects are 
> independent from each other and occupy contiguous segments of memory.  E.g. 
> when Kryo reference tracking is left on, objects may contain pointers to 
> objects farther back in the stream, which means that the sort can't relocate 
> objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to