[ 
https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353940#comment-14353940
 ] 

Reynold Xin edited comment on SPARK-6190 at 3/10/15 12:14 AM:
--------------------------------------------------------------

Hi [~imranr],

As I said earlier, I would advise against attacking the network transfer 
problem at this point. We don't hear that often from users complaining about 
the 2G limit, and the complain of various issues drop probably by an order of 
magnitude in the following order:
- caching 2g
- fetching 2g non shuffle block
- fetching 2g shuffle block
- uploading 2g

I think it'd make sense to solve the caching 2g limit first. It is important to 
think about the network part, but I would not try to address it here. It is 
much more complicated to deal with, e.g. transferring very large data in one 
shot brings all sorts of complicated resource management problems (e.g. large 
transfer blocking small ones, memory management, allocation...).

For caching, I can think of two ways to do this. The first approach, as 
proposed in this ticket, is to have a large byte buffer abstraction that 
encapsulates multiple, smallers buffers. The second approach is to assume the 
block manager can only handle blocks < 2g, and then have the upper layers (e.g. 
CacheManager) handle the chunking and reconnecting. It is not yet clear to me 
which one is better. While the first approach provides a better, clearer 
abstraction, the 2nd approach would be less intrusive and allow us to cache 
partial blocks. Do you have any thoughts on this?

Now for the large buffer abstraction here -- I'm confused. The proposed design 
is read-only. How do we even create a buffer?



was (Author: rxin):
Hi [~imranr],

As I said earlier, I would advise against attacking the network transfer 
problem at this point. We don't hear that often from users complaining about 
the 2G limit, and the complain of various issues drop probably by an order of 
magnitude in the following order:
- caching 2g
- fetching 2g non shuffle block
- fetching 2g shuffle block
- uploading 2g

I think it'd make sense to solve the caching 2g limit first. It is important to 
think about the network part, but I would not try to address it here. It is 
much more complicated to deal with, e.g. transferring very large data in one 
shot brings all sorts of complicated resource management problems (e.g. large 
transfer blocking small ones, memory management, allocation...).

For caching, I can think of two days to do this. First is to have a large byte 
buffer abstraction that encapsulates multiple, smallers buffers, as proposed 
here. Another is to assume the block manager can only handle blocks < 2g, and 
then have the upper layers handle the chunking and reconnecting. It is not yet 
clear to me which one is better. While the first approach provides a better, 
clearer abstraction, the 2nd approach would allow us to cache partial blocks. 
Do you have any thoughts on this?

Now for the large buffer abstraction here -- I'm confused. The proposed design 
is read-only. How do we even create a buffer?


> create LargeByteBuffer abstraction for eliminating 2GB limit on blocks
> ----------------------------------------------------------------------
>
>                 Key: SPARK-6190
>                 URL: https://issues.apache.org/jira/browse/SPARK-6190
>             Project: Spark
>          Issue Type: Sub-task
>          Components: Spark Core
>            Reporter: Imran Rashid
>            Assignee: Imran Rashid
>         Attachments: LargeByteBuffer.pdf
>
>
> A key component in eliminating the 2GB limit on blocks is creating a proper 
> abstraction for storing more than 2GB.  Currently spark is limited by a 
> reliance on nio ByteBuffer and netty ByteBuf, both of which are limited at 
> 2GB.  This task will introduce the new abstraction and the relevant 
> implementation and utilities, without effecting the existing implementation 
> at all.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to