continuing processing when errors occur

2014-07-24 Thread Art Peel
Our system works with RDDs generated from Hadoop files. It processes each record in a Hadoop file and for a subset of those records generates output that is written to an external system via RDD.foreach. There are no dependencies between the records that are processed. If writing to the external

Re: continuing processing when errors occur

2014-07-24 Thread Imran Rashid
Hi Art, I have some advice that isn't spark-specific at all, so it doesn't *exactly* address your questions, but you might still find helpful. I think using an implicit to add your retyring behavior might be useful. I can think of two options: 1. enriching RDD itself, eg. to add a

Re: continuing processing when errors occur

2014-07-24 Thread Imran Rashid
whoops! just realized I was retyring the function even on success. didn't pay enough attention to the output from my calls. Slightly updated definitions: class RetryFunction[-A](nTries: Int,f: A = Unit) extends Function[A,Unit] { def apply(a: A): Unit = { var tries = 0 var success =