[ https://issues.apache.org/jira/browse/SPARK-6190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14353940#comment-14353940 ]
Reynold Xin edited comment on SPARK-6190 at 3/10/15 12:14 AM: -------------------------------------------------------------- Hi [~imranr], As I said earlier, I would advise against attacking the network transfer problem at this point. We don't hear that often from users complaining about the 2G limit, and the complain of various issues drop probably by an order of magnitude in the following order: - caching 2g - fetching 2g non shuffle block - fetching 2g shuffle block - uploading 2g I think it'd make sense to solve the caching 2g limit first. It is important to think about the network part, but I would not try to address it here. It is much more complicated to deal with, e.g. transferring very large data in one shot brings all sorts of complicated resource management problems (e.g. large transfer blocking small ones, memory management, allocation...). For caching, I can think of two ways to do this. The first approach, as proposed in this ticket, is to have a large byte buffer abstraction that encapsulates multiple, smallers buffers. The second approach is to assume the block manager can only handle blocks < 2g, and then have the upper layers (e.g. CacheManager) handle the chunking and reconnecting. It is not yet clear to me which one is better. While the first approach provides a better, clearer abstraction, the 2nd approach would be less intrusive and allow us to cache partial blocks. Do you have any thoughts on this? Now for the large buffer abstraction here -- I'm confused. The proposed design is read-only. How do we even create a buffer? was (Author: rxin): Hi [~imranr], As I said earlier, I would advise against attacking the network transfer problem at this point. We don't hear that often from users complaining about the 2G limit, and the complain of various issues drop probably by an order of magnitude in the following order: - caching 2g - fetching 2g non shuffle block - fetching 2g shuffle block - uploading 2g I think it'd make sense to solve the caching 2g limit first. It is important to think about the network part, but I would not try to address it here. It is much more complicated to deal with, e.g. transferring very large data in one shot brings all sorts of complicated resource management problems (e.g. large transfer blocking small ones, memory management, allocation...). For caching, I can think of two days to do this. First is to have a large byte buffer abstraction that encapsulates multiple, smallers buffers, as proposed here. Another is to assume the block manager can only handle blocks < 2g, and then have the upper layers handle the chunking and reconnecting. It is not yet clear to me which one is better. While the first approach provides a better, clearer abstraction, the 2nd approach would allow us to cache partial blocks. Do you have any thoughts on this? Now for the large buffer abstraction here -- I'm confused. The proposed design is read-only. How do we even create a buffer? > create LargeByteBuffer abstraction for eliminating 2GB limit on blocks > ---------------------------------------------------------------------- > > Key: SPARK-6190 > URL: https://issues.apache.org/jira/browse/SPARK-6190 > Project: Spark > Issue Type: Sub-task > Components: Spark Core > Reporter: Imran Rashid > Assignee: Imran Rashid > Attachments: LargeByteBuffer.pdf > > > A key component in eliminating the 2GB limit on blocks is creating a proper > abstraction for storing more than 2GB. Currently spark is limited by a > reliance on nio ByteBuffer and netty ByteBuf, both of which are limited at > 2GB. This task will introduce the new abstraction and the relevant > implementation and utilities, without effecting the existing implementation > at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org