[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998752#comment-16998752 ]
Shirish commented on SPARK-1476: -------------------------------- This is an old chain that I happen to land on to. I am interested in the following points mentioned by [~mridulm80]. Did anyone ever get to implementing MultiOutputs map without needing to use cache? If not, is there a pointer I can get on how to get started. _"_[~matei] _Interesting that you should mention about splitting output of a map into multiple blocks__._ _We are actually thinking about that in a different context - akin to MultiOutputs in hadoop or SPLIT in pig : without needing to cache the intermediate output; but directly emit values to different blocks/rdd's based on the output of a map or some such."_ > 2GB limit in spark for blocks > ----------------------------- > > Key: SPARK-1476 > URL: https://issues.apache.org/jira/browse/SPARK-1476 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Environment: all > Reporter: Mridul Muralidharan > Priority: Critical > Attachments: 2g_fix_proposal.pdf > > > The underlying abstraction for blocks in spark is a ByteBuffer : which limits > the size of the block to 2GB. > This has implication not just for managed blocks in use, but also for shuffle > blocks (memory mapped blocks are limited to 2gig, even though the api allows > for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. > This is a severe limitation for use of spark when used on non trivial > datasets. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org