Modifying an RDD in forEach

2014-12-06 Thread Ron Ayoub
This is from a separate thread with a differently named title. Why can't you modify the actual contents of an RDD using forEach? It appears to be working for me. What I'm doing is changing cluster assignments and distances per data item for each iteration of the clustering algorithm. The

Re: Modifying an RDD in forEach

2014-12-06 Thread Mayur Rustagi
You'll benefit by viewing Matei's talk in Yahoo on Spark internals and how it optimizes execution of iterative jobs. Simple answer is 1. Spark doesn't materialize RDD when you do an iteration but lazily captures the transformation functions in RDD.(only function and closure , no data operation

Re: Modifying an RDD in forEach

2014-12-06 Thread Mohit Jaggi
Ron, “appears to be working” might be true when there are no failures. on large datasets being processed on a large number of machines, failures of several types(server, network, disk etc) can happen. At that time, Spark will not “know” that you changed the RDD in-place and will use any version

RE: Modifying an RDD in forEach

2014-12-06 Thread Ron Ayoub
cluster computing yes - but iterative... ? not sure. I have that book Functional Programming in Scala and I hope to read it someday and enrich my understanding here. Subject: Re: Modifying an RDD in forEach From: mohitja...@gmail.com Date: Sat, 6 Dec 2014 13:13:50 -0800 CC: ronalday...@live.com