You'll benefit by viewing Matei's talk in Yahoo on Spark internals and how
it optimizes execution of iterative jobs.
Simple answer is
1. Spark doesn't materialize RDD when you do an iteration but lazily
captures the transformation functions in RDD.(only function and closure ,
no data operation
Ron,
“appears to be working” might be true when there are no failures. on large
datasets being processed on a large number of machines, failures of several
types(server, network, disk etc) can happen. At that time, Spark will not
“know” that you changed the RDD in-place and will use any version
cluster computing yes - but iterative... ? not sure. I have that book
Functional Programming in Scala and I hope to read it someday and enrich my
understanding here.
Subject: Re: Modifying an RDD in forEach
From: mohitja...@gmail.com
Date: Sat, 6 Dec 2014 13:13:50 -0800
CC: ronalday...@live.com