[jira] [Comment Edited] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998752#comment-16998752 ] Shirish edited comment on SPARK-1476 at 12/18/19 2:38 AM: -- This is an old chain that I happen to land on to. I am interested in the following points mentioned by [~mridulm80]. Did anyone ever get to implementing MultiOutputs map without needing to use cache? If not, can anyone give me a pointer on how to get started. _"_[~matei] _Interesting that you should mention about splitting output of a map into multiple blocks__._ _We are actually thinking about that in a different context - akin to MultiOutputs in hadoop or SPLIT in pig : without needing to cache the intermediate output; but directly emit values to different blocks/rdd's based on the output of a map or some such."_ was (Author: shirishkumar): This is an old chain that I happen to land on to. I am interested in the following points mentioned by [~mridulm80]. Did anyone ever get to implementing MultiOutputs map without needing to use cache? If not, is there a pointer I can get on how to get started. _"_[~matei] _Interesting that you should mention about splitting output of a map into multiple blocks__._ _We are actually thinking about that in a different context - akin to MultiOutputs in hadoop or SPLIT in pig : without needing to cache the intermediate output; but directly emit values to different blocks/rdd's based on the output of a map or some such."_ > 2GB limit in spark for blocks > - > > Key: SPARK-1476 > URL: https://issues.apache.org/jira/browse/SPARK-1476 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Environment: all >Reporter: Mridul Muralidharan >Priority: Critical > Attachments: 2g_fix_proposal.pdf > > > The underlying abstraction for blocks in spark is a ByteBuffer : which limits > the size of the block to 2GB. > This has implication not just for managed blocks in use, but also for shuffle > blocks (memory mapped blocks are limited to 2gig, even though the api allows > for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. > This is a severe limitation for use of spark when used on non trivial > datasets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099333#comment-14099333 ] Reynold Xin edited comment on SPARK-1476 at 8/15/14 11:12 PM: -- Let's work together to get something for 1.2 or 1.3. At the very least, I would like to have a buffer abstraction layer that can support this in the future. was (Author: rxin): Let's work together to get something for 1.2 or 1.3. > 2GB limit in spark for blocks > - > > Key: SPARK-1476 > URL: https://issues.apache.org/jira/browse/SPARK-1476 > Project: Spark > Issue Type: Improvement > Components: Spark Core > Environment: all >Reporter: Mridul Muralidharan >Assignee: Mridul Muralidharan >Priority: Critical > Attachments: 2g_fix_proposal.pdf > > > The underlying abstraction for blocks in spark is a ByteBuffer : which limits > the size of the block to 2GB. > This has implication not just for managed blocks in use, but also for shuffle > blocks (memory mapped blocks are limited to 2gig, even though the api allows > for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. > This is a severe limitation for use of spark when used on non trivial > datasets. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13968974#comment-13968974 ] Patrick Wendell edited comment on SPARK-1476 at 4/14/14 11:05 PM: -- Okay sounds good - a POC like that would be really helpful. We've run some very large shuffles recently and couldn't isolate any problems except for SPARK-1239, which should only be a small change. If there are some smaller fixes you run into then by all means submit them directly. If it requires large architectural changes I'd recommend having a design doc before submitting a pull request, because people will want to discuss the overall approach. I.e. we should avoid being "break-fix" and think about the long term design implications of changes. was (Author: pwendell): Okay sounds good - a POC like that would be really helpful. We've run some very large shuffles recently and couldn't isolate any problems except for SPARK-1239, which should only be a small change. If there are some smaller fixes you run into then by all means submit them directly. If it requires large architectural changes I'd recommend having a design doc before submitting a pull request, because people will want to discuss the overall approach. I.e. we should avoid being "break-fix" and think about the long term design implications of chagnes. > 2GB limit in spark for blocks > - > > Key: SPARK-1476 > URL: https://issues.apache.org/jira/browse/SPARK-1476 > Project: Spark > Issue Type: Bug > Components: Spark Core > Environment: all >Reporter: Mridul Muralidharan >Assignee: Mridul Muralidharan >Priority: Critical > Fix For: 1.1.0 > > > The underlying abstraction for blocks in spark is a ByteBuffer : which limits > the size of the block to 2GB. > This has implication not just for managed blocks in use, but also for shuffle > blocks (memory mapped blocks are limited to 2gig, even though the api allows > for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. > This is a severe limitation for use of spark when used on non trivial > datasets. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967911#comment-13967911 ] Patrick Wendell edited comment on SPARK-1476 at 4/13/14 7:01 PM: - [~mrid...@yahoo-inc.com] I think the proposed change would benefit from a design doc to explain exactly the cases we want to fix and what trade-offs we are willing to make in terms of complexity. Agreed that there is definitely room for improvement in the out-of-the-box behavior here. Right now the limits as I understand them are (a) the shuffle output from one mapper to one reducer cannot be more than 2GB. (b) partitions of an RDD cannot exceed 2GB. I see (a) as the bigger of the two issues. It would be helpful to have specific examples of workloads where this causes a problem and the associated data sizes, etc. For instance, say I want to do a 1 Terabyte shuffle. Right now number of (mappers * reducers) needs to be > ~1000 for this to work (e.g. 100 mappers and 10 reducers) assuming uniform partitioning. That doesn't seem too crazy of an assumption, but if you have skew this would be a much bigger problem. Would it be possible to improve (a) but not (b) with a much simpler design? I'm not sure (maybe they reduce to the same problem), but it's something a design doc could help flesh out. Popping up a bit - I think our goal should be to handle reasonable workloads and not to be 100% compliant with the semantics of Hadoop MapReduce. After all, in-memory RDD's are not even a concept in MapReduce. And keep in mind that MapReduce became so bloated/complex of a project that it is today no longer possible to make substantial changes to it. That's something we definitely want to avoid by keeping Spark internals as simple as possible. was (Author: pwendell): [~mrid...@yahoo-inc.com] I think the proposed change would benefit from a design doc to explain exactly the cases we want to fix and what trade-offs we are willing to make in terms of complexity. Agreed that there is definitely room for improvement in the out-of-the-box behavior here. Right now the limits as I understand them are (a) the shuffle output from one mapper to one reducer cannot be more than 2GB. (b) partitions of an RDD cannot exceed 2GB. I see (a) as the bigger of the two issues. It would be helpful to have specific examples of workloads where this causes a problem and the associated data sizes, etc. For instance, say I want to do a 1 Terabyte shuffle. Right now number of (mappers * reducers) needs to be > ~1000 for this to work (e.g. 100 mappers and 10 reducers) assuming uniform partitioning. That doesn't seem too crazy of an assumption, but if you have skew this would be a much bigger problem. Popping up a bit - I think our goal should be to handle reasonable workloads and not to be 100% compliant with the semantics of Hadoop MapReduce. After all, in-memory RDD's are not even a concept in MapReduce. And keep in mind that MapReduce became so bloated/complex of a project that it is today no longer possible to make substantial changes to it. That's something we definitely want to avoid by keeping Spark internals as simple as possible. > 2GB limit in spark for blocks > - > > Key: SPARK-1476 > URL: https://issues.apache.org/jira/browse/SPARK-1476 > Project: Spark > Issue Type: Bug > Components: Spark Core > Environment: all >Reporter: Mridul Muralidharan >Priority: Critical > Fix For: 1.1.0 > > > The underlying abstraction for blocks in spark is a ByteBuffer : which limits > the size of the block to 2GB. > This has implication not just for managed blocks in use, but also for shuffle > blocks (memory mapped blocks are limited to 2gig, even though the api allows > for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. > This is a severe limitation for use of spark when used on non trivial > datasets. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967854#comment-13967854 ] Mridul Muralidharan edited comment on SPARK-1476 at 4/13/14 2:45 PM: - There are multiple issues at play here : a) If a block goes beyond 2G, everything fails - this is the case for shuffle, cached and 'normal' blocks. Irrespective of storage level and/or other config. In a lot of cases, particularly when data generated 'increases' after a map/flatMap/etc, the user has no control over the size increase in the output block for a given input block (think iterations over datasets). b) Increasing number of partitions is not always an option (for the subset of usecases where it can be done) : 1) It has an impact on number of intermediate files created while doing a shuffle. (ulimit constraints, IO performance issues, etc). 2) It does not help when there is skew anyway. c) Compression codec, serializer used, etc have an impact. d) 2G is extremely low limit to have in modern hardware : and this is particularly a severe limitation when we have nodes running on 32G to 64G ram and TB's of disk space available for spark. To address specific points raised above : A) [~pwendell] Mapreduce jobs dont fail in case the block size of files increases - it might be inefficient, it still runs (not that I know of any case where it does actually, but theoretically I guess it can become inefficient). So analogy does not apply. Also to add, 2G is not really an unlimited increase in block size - and in MR, output of a map can easily go a couple of orders above 2G : whether it is followed by a reduce or not. B) [~matei] In the specific cases it was failing, the users were not caching the data but directly going to shuffle. There was no skew from what I see : just the data size per key is high; and there are a lot of keys too btw (as iterations increase and nnz increases). Note that it was an impl detail that it was not being cached - it could have been too. Additionally, compression and/or serialization also apply implicitly in this case, since it was impacting shuffle - the 2G limit was observed at both the map and reduce side (in two different jobs). In general, our effort is to make spark as a drop in replacement for most usecases which are currently being done via MR/Pig/etc. Limitations of this sort make it difficult to position spark as a credible alternative. Current approach we are exploring is to remove all direct references to ByteBuffer from spark (except for ConnectionManager, etc parts); and rely on a BlockData or similar datastructure which encapsulate the data corresponding to a block. By default, a single ByteBuffer should suffice but in case it does not, the class will automatically take care of splitting across blocks. Similarly, all references to byte array backed streams will need to be replaced with a wrapper stream which multiplexes over byte array streams. The performance impact for all 'normal' usecases should be the minimal, while allowing for spark to be used in cases where 2G limit is being hit. The only unknown here is tachyon integration : where the interface is a ByteBuffer - and I am not knowledgable enough to comment on what the issues there would be. was (Author: mridulm80): There are multiple issues at play here : a) If a block goes beyond 2G, everything fails - this is the case for shuffle, cached and 'normal' blocks. Irrespective of storage level and/or other config. In a lot of cases, particularly when data generated 'increases' after a map/flatMap/etc, the user has no control over the size increase in the output block for a given input block (think iterations over datasets). b) Increasing number of partitions is not always an option (for the subset of usecases where it can be done) : 1) It has an impact on number of intermediate files created while doing a shuffle. (ulimit constraints, IO performance issues, etc). 2) It does not help when there is skew anyway. c) Compression codec, serializer used, etc have an impact. d) 2G is extremely low limit to have in modern hardware : and this is practically a severe limitation when we have nodes running on 32G to 64G ram and TB's of disk space available for spark. To address specific points raised above : A) [~pwendell] Mapreduce jobs dont fail in case the block size of files increases - it might be inefficient, it still runs (not that I know of any case where it does actually, but theoretically I guess it can). So analogy does not apply. To add, 2G is not really an unlimited increase in block size - and in MR, output of a map can easily go a couple of orders above 2G. B) [~matei] In the specific cases it was failing, the users were not caching the data but directly going to shuffle. There was no skew from what we see : just the data size per key is high;
[jira] [Comment Edited] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967720#comment-13967720 ] Patrick Wendell edited comment on SPARK-1476 at 4/13/14 3:41 AM: - This says it's a "severe limitation" - but why not just use more, smaller blocks? I think Spark's design assumes in various places that block's are not extremely large. Think of it like e.g. the HDFS block size... it can't be arbitrary large. The answer here might be to use multiple blocks in the case of something like a shuffle where the size can get really large. was (Author: pwendell): This says it's a "severe limitation" - but why not just use more, smaller blocks? I think Spark's design assumes in various places that block's are not extremely large. Think of it like e.g. the HDFS block size... it can't be arbitrary large. > 2GB limit in spark for blocks > - > > Key: SPARK-1476 > URL: https://issues.apache.org/jira/browse/SPARK-1476 > Project: Spark > Issue Type: Bug > Components: Spark Core > Environment: all >Reporter: Mridul Muralidharan >Priority: Critical > Fix For: 1.1.0 > > > The underlying abstraction for blocks in spark is a ByteBuffer : which limits > the size of the block to 2GB. > This has implication not just for managed blocks in use, but also for shuffle > blocks (memory mapped blocks are limited to 2gig, even though the api allows > for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. > This is a severe limitation for use of spark when used on non trivial > datasets. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1476) 2GB limit in spark for blocks
[ https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13967720#comment-13967720 ] Patrick Wendell edited comment on SPARK-1476 at 4/13/14 3:35 AM: - This says it's a "severe limitation" - but why not just use more, smaller blocks? I think Spark's design assumes in various places that block's are not extremely large. Think of it like e.g. the HDFS block size... it can't be arbitrary large. was (Author: pwendell): This says it's a "severe limitation" - but why not just use more, smaller blocks? I Spark's design assumes in various places that block's are not extremely large. Think of it like e.g. the HDFS block size... it can't be arbitrary large. > 2GB limit in spark for blocks > - > > Key: SPARK-1476 > URL: https://issues.apache.org/jira/browse/SPARK-1476 > Project: Spark > Issue Type: Bug > Components: Spark Core > Environment: all >Reporter: Mridul Muralidharan >Priority: Critical > Fix For: 1.1.0 > > > The underlying abstraction for blocks in spark is a ByteBuffer : which limits > the size of the block to 2GB. > This has implication not just for managed blocks in use, but also for shuffle > blocks (memory mapped blocks are limited to 2gig, even though the api allows > for long), ser-deser via byte array backed outstreams (SPARK-1391), etc. > This is a severe limitation for use of spark when used on non trivial > datasets. -- This message was sent by Atlassian JIRA (v6.2#6252)