This is from a separate thread with a differently named title.
Why can't you modify the actual contents of an RDD using forEach? It appears to
be working for me. What I'm doing is changing cluster assignments and distances
per data item for each iteration of the clustering algorithm. The
You'll benefit by viewing Matei's talk in Yahoo on Spark internals and how
it optimizes execution of iterative jobs.
Simple answer is
1. Spark doesn't materialize RDD when you do an iteration but lazily
captures the transformation functions in RDD.(only function and closure ,
no data operation
Ron,
“appears to be working” might be true when there are no failures. on large
datasets being processed on a large number of machines, failures of several
types(server, network, disk etc) can happen. At that time, Spark will not
“know” that you changed the RDD in-place and will use any version
cluster computing yes - but iterative... ? not sure. I have that book
Functional Programming in Scala and I hope to read it someday and enrich my
understanding here.
Subject: Re: Modifying an RDD in forEach
From: mohitja...@gmail.com
Date: Sat, 6 Dec 2014 13:13:50 -0800
CC: ronalday...@live.com