Thank you, yes that makes sense. I was aware of transformations and actions, but did not realize foreach was an action. I've found the exhaustive list here http://spark.apache.org/docs/latest/programming-guide.html#actions and it's clear to me again.
Thank you for your help! On Wed, Mar 9, 2016 at 11:37 AM, Jakob Odersky <ja...@odersky.com> wrote: > Hi Jeff, > > > But in our development environment, the returned RDD results were empty > and b.function(_) was never executed > what do you mean by "the returned RDD results were empty", did you try > running a foreach, collect or any other action on the returned RDD[C]? > > Spark provides two kinds of operations on RDDs: > 1. transformations, which return a new RDD and are lazy and > 2. actions that actually run an RDD and return some kind of result. > In your example above, 'map' is a transformation and thus is not > actually applied until some action (like 'foreach') is called on the > resulting RDD. > You can find more information in the Spark Programming Guide > http://spark.apache.org/docs/latest/programming-guide.html#rdd-operations. > > best, > --Jakob > > On Tue, Mar 8, 2016 at 5:41 PM, Hemminger Jeff <j...@atware.co.jp> wrote: > > > > I'm currently developing a Spark Streaming application. > > > > I have a function that receives an RDD and an object instance as a > > parameter, and returns an RDD: > > > > def doTheThing(a: RDD[A], b: B): RDD[C] > > > > > > Within the function, I do some processing within a map of the RDD. > > Like this: > > > > > > def doTheThing(a: RDD[A], b: B): RDD[C] { > > > > a.combineByKey(...).map(b.function(_)) > > > > } > > > > > > I combine the RDD by key, then map the results calling a function of > > instance b, and return the results. > > > > Here is where I ran into trouble. > > > > In a unit test running Spark in memory, I was able to convince myself > that > > this worked well. > > > > But in our development environment, the returned RDD results were empty > and > > b.function(_) was never executed. > > > > However, when I added an otherwise useless foreach: > > > > > > doTheThing(a: RDD[A], b: B): RDD[C] { > > > > val results = a.combineByKey(...).map(b.function(_)) > > > > results.foreach( p => p ) > > > > results > > > > } > > > > > > Then it works. > > > > So, basically, adding an extra foreach iteration appears to cause > > b.function(_) to execute and returns results correctly. > > > > I find this confusing. Can anyone shed some light on why this would be? > > > > Thank you, > > Jeff > > > > > > >