[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967947#comment-13967947 ]
Matei Zaharia commented on SPARK-1476: -------------------------------------- Hey Mridul, the one thing I'd add as an alternative is whether we could have splitting happen at a higher level than the block manager. For example, maybe a map task is allowed to create 2 output blocks for a given reducer, or maybe a cached RDD partition gets stored as 2 blocks. This might be slightly easier to implement than replacing all instances of ByteBuffers. But I agree that this should be addressed somehow, since 2 GB will become more and more limiting over time. Anyway, I'd love to see a more detailed design. I think even the replace-ByteBuffers approach you proposed can be made to work with Tachyon. > 2GB limit in spark for blocks > ----------------------------- > > Key: SPARK-1476 > URL: https://issues.apache.org/jira/browse/SPARK-1476 > Project: Spark > Issue Type: Bug > Components: Spark Core > Environment: all > Reporter: Mridul Muralidharan > Priority: Critical > Fix For: 1.1.0 > > > The underlying abstraction for blocks in spark is a ByteBuffer : which limits > the size of the block to 2GB. > This has implication not just for managed blocks in use, but also for shuffle > blocks (memory mapped blocks are limited to 2gig, even though the api allows > for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. > This is a severe limitation for use of spark when used on non trivial > datasets. -- This message was sent by Atlassian JIRA (v6.2#6252)