Sandy, Can you clarify how it won't cause OOM? Is it anyway to related to memory being allocated outside the heap - native space? The reason I ask is that I have a use case to store shuffle data in HDFS. Since there is no notion of memory mapped files, I need to store it as a byte buffer. I want to make sure this will not cause OOM when the file size is large.
-- Kannan On Tue, Apr 14, 2015 at 9:07 AM, Sandy Ryza <sandy.r...@cloudera.com> wrote: > Hi Kannan, > > Both in MapReduce and Spark, the amount of shuffle data a task produces > can exceed the tasks memory without risk of OOM. > > -Sandy > > On Tue, Apr 14, 2015 at 6:47 AM, Imran Rashid <iras...@cloudera.com> > wrote: > >> That limit doesn't have anything to do with the amount of available >> memory. Its just a tuning parameter, as one version is more efficient for >> smaller files, the other is better for bigger files. I suppose the >> comment >> is a little better in FileSegmentManagedBuffer: >> >> >> https://github.com/apache/spark/blob/master/network/common/src/main/java/org/apache/spark/network/buffer/FileSegmentManagedBuffer.java#L62 >> >> On Tue, Apr 14, 2015 at 12:01 AM, Kannan Rajah <kra...@maprtech.com> >> wrote: >> >> > DiskStore.getBytes uses memory mapped files if the length is more than a >> > configured limit. This code path is used during map side shuffle in >> > ExternalSorter. I want to know if its possible for the length to exceed >> the >> > limit in the case of shuffle. The reason I ask is in the case of Hadoop, >> > each map task is supposed to produce only data that can fit within the >> > task's configured max memory. Otherwise it will result in OOM. Is the >> > behavior same in Spark or the size of data generated by a map task can >> > exceed what can be fitted in memory. >> > >> > if (length < minMemoryMapBytes) { >> > val buf = ByteBuffer.allocate(length.toInt) >> > .... >> > } else { >> > Some(channel.map(MapMode.READ_ONLY, offset, length)) >> > } >> > >> > -- >> > Kannan >> > >> > >