[ https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383280#comment-14383280 ]
Sandy Ryza commented on SPARK-4550: ----------------------------------- Java serialization appears to write out the full class name the first time an object is written and then refer to it by an identifier afterwards: {code} scala> val baos = new ByteArrayOutputStream() scala> val oos = new ObjectOutputStream(baos) scala> oos.writeObject(new java.util.Date()) scala> oos.flush() scala> baos.toString res8: String = ��??sr??java.util.Datehj�?KYt????xpw????LY6: x scala> baos.toByteArray.length res9: Int = 46 scala> oos.writeObject(new java.util.Date()) scala> oos.flush() scala> baos.toString res14: String = ��??sr??java.util.Datehj�?KYt????xpw????LY6: xsq?~??w????LY6�Dx scala> baos.toByteArray.length res13: Int = 63 scala> oos.writeObject(new java.util.Date()) scala> oos.flush() scala> baos.toString res17: String = ��??sr??java.util.Datehj�?KYt????xpw????LY6: xsq?~??w????LY6�Dxsq?~??w????LY8?�x scala> baos.toByteArray.length res18: Int = 80 {code} There might be some fancy way to listen for the class name being written out and relocate that segment to the front of the stream. However, this seems fairly and involved and bug-prone; my opinion is that isn't not worth it given that Java ser is already a severely performance-impaired option. Another option of course would be to write the class name in front of every record, but this would bloat the serialized representation considerably. > In sort-based shuffle, store map outputs in serialized form > ----------------------------------------------------------- > > Key: SPARK-4550 > URL: https://issues.apache.org/jira/browse/SPARK-4550 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core > Affects Versions: 1.2.0 > Reporter: Sandy Ryza > Assignee: Sandy Ryza > Priority: Critical > Attachments: SPARK-4550-design-v1.pdf, kryo-flush-benchmark.scala > > > One drawback with sort-based shuffle compared to hash-based shuffle is that > it ends up storing many more java objects in memory. If Spark could store > map outputs in serialized form, it could > * spill less often because the serialized form is more compact > * reduce GC pressure > This will only work when the serialized representations of objects are > independent from each other and occupy contiguous segments of memory. E.g. > when Kryo reference tracking is left on, objects may contain pointers to > objects farther back in the stream, which means that the sort can't relocate > objects without corrupting them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org