[ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14327394#comment-14327394
 ] 

Imran Rashid commented on SPARK-1476:
-------------------------------------

Based on discussion on the dev list, [~mridulm80] isn't actively working on 
this.   I'd like to start on it, with the following very minimal goals:

1. Make it *possible* for blocks to be bigger than 2GB
2. Maintain performance on smaller blocks

ie., I'm not going to try to do anything fancy to optimize performance of the 
large blocks.  To that end, my plan is to

1. create a {{LargeByteBuffer}} interface, which just has the same methods we 
use on {{ByteBuffer}}
2. have one implementation that just wraps one {{ByteBuffer}}, and another 
which wraps a completely static set of {{ByteBuffer}}s (eg., if you map a 3 GB 
file, it'll just immediately map it to 2 {{ByteBuffer}}s, nothing fancy with 
only mapping the first half of the file until the second is needed etc. etc.)
3. change {{ByteBuffer}} to {{LargeByteBuffer}} in {{ShuffleBlockManager}} and 
{{BlockStore}}

I see that about a year back there was a lot of discussion on this, and some 
alternate proposals.  I'd like to push forward with a POC to try to move the 
discussion along again.  I know there was some discussion about how important 
this is, and whether or not we want to support it.  IMO this is a big 
limitation and results in a lot of frustration for the users, we really need a 
solution for this.

> 2GB limit in spark for blocks
> -----------------------------
>
>                 Key: SPARK-1476
>                 URL: https://issues.apache.org/jira/browse/SPARK-1476
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>         Environment: all
>            Reporter: Mridul Muralidharan
>            Assignee: Mridul Muralidharan
>            Priority: Critical
>         Attachments: 2g_fix_proposal.pdf
>
>
> The underlying abstraction for blocks in spark is a ByteBuffer : which limits 
> the size of the block to 2GB.
> This has implication not just for managed blocks in use, but also for shuffle 
> blocks (memory mapped blocks are limited to 2gig, even though the api allows 
> for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
> This is a severe limitation for use of spark when used on non trivial 
> datasets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to