Confusing RDD function

Hemminger Jeff Tue, 08 Mar 2016 17:41:32 -0800

I'm currently developing a Spark Streaming application.

I have a function that receives an RDD and an object instance as  a
parameter, and returns an RDD:


def doTheThing(a: RDD[A], b: B): RDD[C]


Within the function, I do some processing within a map of the RDD.
Like this:


def doTheThing(a: RDD[A], b: B): RDD[C] {

  a.combineByKey(...).map(b.function(_))

}


I combine the RDD by key, then map the results calling a function of
instance b, and return the results.

Here is where I ran into trouble.

In a unit test running Spark in memory, I was able to convince myself that
this worked well.

But in our development environment, the returned RDD results were empty and
b.function(_) was never executed.

However, when I added an otherwise useless foreach:


doTheThing(a: RDD[A], b: B): RDD[C] {

  val results = a.combineByKey(...).map(b.function(_))

  results.foreach( p => p )

  results

}


Then it works.

So, basically, adding an extra foreach iteration appears to cause
b.function(_) to execute and returns results correctly.

I find this confusing. Can anyone shed some light on why this would be?

Thank you,
Jeff

Confusing RDD function

Reply via email to