Spark RDDs are lazily computed and hence unless an 'action' is applied which mandates the computation - there won't be any computation. You can read more on spark docs. On Mar 9, 2016 7:11 AM, "Hemminger Jeff" <j...@atware.co.jp> wrote:
> > I'm currently developing a Spark Streaming application. > > I have a function that receives an RDD and an object instance as a > parameter, and returns an RDD: > > def doTheThing(a: RDD[A], b: B): RDD[C] > > > Within the function, I do some processing within a map of the RDD. > Like this: > > > def doTheThing(a: RDD[A], b: B): RDD[C] { > > a.combineByKey(...).map(b.function(_)) > > } > > > I combine the RDD by key, then map the results calling a function of > instance b, and return the results. > > Here is where I ran into trouble. > > In a unit test running Spark in memory, I was able to convince myself that > this worked well. > > But in our development environment, the returned RDD results were empty > and b.function(_) was never executed. > > However, when I added an otherwise useless foreach: > > > doTheThing(a: RDD[A], b: B): RDD[C] { > > val results = a.combineByKey(...).map(b.function(_)) > > results.foreach( p => p ) > > results > > } > > > Then it works. > > So, basically, adding an extra foreach iteration appears to cause > b.function(_) to execute and returns results correctly. > > I find this confusing. Can anyone shed some light on why this would be? > > Thank you, > Jeff > > > >