[ 
https://issues.apache.org/jira/browse/SPARK-4550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14221710#comment-14221710
 ] 

Sandy Ryza commented on SPARK-4550:
-----------------------------------

We don't, though it would allow us to be much more efficient in certain 
situations.

The way sort-based shuffle works right now, the map side only sorts by the 
partition, so we can store this number alongside the serialized record and not 
need to compare keys at all.

SPARK-2926 proposes sorting by keys on the map side.  For that, we'd need to 
deserialize keys before comparing them.  There might be situations where this 
is slower than not serializing them in the first place.  But even in those 
situations, we'd get more reliability by stressing GC less.  It would probably 
be good to define raw comparators for common raw-comparable key types like ints 
and strings.

> In sort-based shuffle, store map outputs in serialized form
> -----------------------------------------------------------
>
>                 Key: SPARK-4550
>                 URL: https://issues.apache.org/jira/browse/SPARK-4550
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle, Spark Core
>    Affects Versions: 1.2.0
>            Reporter: Sandy Ryza
>
> One drawback with sort-based shuffle compared to hash-based shuffle is that 
> it ends up storing many more java objects in memory.  If Spark could store 
> map outputs in serialized form, it could
> * spill less often because the serialized form is more compact
> * reduce GC pressure
> This will only work when the serialized representations of objects are 
> independent from each other and occupy contiguous segments of memory.  E.g. 
> when Kryo reference tracking is left on, objects may contain pointers to 
> objects farther back in the stream, which means that the sort can't relocate 
> objects without corrupting them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to