[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327394#comment-14327394 ]
Imran Rashid commented on SPARK-1476: ------------------------------------- Based on discussion on the dev list, [~mridulm80] isn't actively working on this. I'd like to start on it, with the following very minimal goals: 1. Make it *possible* for blocks to be bigger than 2GB 2. Maintain performance on smaller blocks ie., I'm not going to try to do anything fancy to optimize performance of the large blocks. To that end, my plan is to 1. create a {{LargeByteBuffer}} interface, which just has the same methods we use on {{ByteBuffer}} 2. have one implementation that just wraps one {{ByteBuffer}}, and another which wraps a completely static set of {{ByteBuffer}}s (eg., if you map a 3 GB file, it'll just immediately map it to 2 {{ByteBuffer}}s, nothing fancy with only mapping the first half of the file until the second is needed etc. etc.) 3. change {{ByteBuffer}} to {{LargeByteBuffer}} in {{ShuffleBlockManager}} and {{BlockStore}} I see that about a year back there was a lot of discussion on this, and some alternate proposals. I'd like to push forward with a POC to try to move the discussion along again. I know there was some discussion about how important this is, and whether or not we want to support it. IMO this is a big limitation and results in a lot of frustration for the users, we really need a solution for this. > 2GB limit in spark for blocks > ----------------------------- > > Key: SPARK-1476 > URL: https://issues.apache.org/jira/browse/SPARK-1476 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Environment: all > Reporter: Mridul Muralidharan > Assignee: Mridul Muralidharan > Priority: Critical > Attachments: 2g_fix_proposal.pdf > > > The underlying abstraction for blocks in spark is a ByteBuffer : which limits > the size of the block to 2GB. > This has implication not just for managed blocks in use, but also for shuffle > blocks (memory mapped blocks are limited to 2gig, even though the api allows > for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. > This is a severe limitation for use of spark when used on non trivial > datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org