[ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998752#comment-16998752
 ] 

Shirish commented on SPARK-1476:
--------------------------------

This is an old chain that I happen to land on to. I am interested in the 
following points mentioned by [~mridulm80].  Did anyone ever get to 
implementing MultiOutputs map without needing to use cache?  If not, is there a 
pointer I can get on how to get started.

_"_[~matei] _Interesting that you should mention about splitting output of a 
map into multiple blocks__._

_We are actually thinking about that in a different context - akin to 
MultiOutputs in hadoop or SPLIT in pig : without needing to cache the 
intermediate output; but directly emit values to different blocks/rdd's based 
on the output of a map or some such."_

 

> 2GB limit in spark for blocks
> -----------------------------
>
>                 Key: SPARK-1476
>                 URL: https://issues.apache.org/jira/browse/SPARK-1476
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>         Environment: all
>            Reporter: Mridul Muralidharan
>            Priority: Critical
>         Attachments: 2g_fix_proposal.pdf
>
>
> The underlying abstraction for blocks in spark is a ByteBuffer : which limits 
> the size of the block to 2GB.
> This has implication not just for managed blocks in use, but also for shuffle 
> blocks (memory mapped blocks are limited to 2gig, even though the api allows 
> for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
> This is a severe limitation for use of spark when used on non trivial 
> datasets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to