thats right, its the reduce operation that makes the in-memory assumption,
not the map (although i am still suspicious that the map actually streams
from disk to disk record by record).

in reality though my experience is that is spark can not fit partitions in
memory it doesnt work well. i get OOMs. and my to OOMs is almost always
simply to increase number of partitions. maybe there is a better way that i
am not aware of.

On Sat, Feb 13, 2016 at 6:32 PM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:

>
> On Fri, Feb 12, 2016 at 11:10 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> in spark, every partition needs to fit in the memory available to the
>> core processing it.
>>
>
> That does not agree with my understanding of how it works. I think you
> could do sc.textFile("input").coalesce(1).map(_.replace("A",
> "B")).saveAsTextFile("output") on a 1 TB local file and it would work fine.
> (I just tested this with a 3 GB file with a 1 GB executor.)
>
> RDDs are mostly implemented using iterators. For example map() operates
> line-by-line, never pulling in the whole partition into memory. coalesce()
> also just concatenates the iterators of the parent partitions into a new
> iterator.
>
> Some operations, like reduceByKey(), need to have the whole contents of
> the partition to work. But they typically use ExternalAppendOnlyMap, so
> they spill to disk instead of filling up the memory.
>
> I know I'm not helping to answer Christopher's issue. Christopher, can you
> perhaps give us an example that we can easily try in spark-shell to see the
> same problem?
>

Reply via email to