Hello Spark fellows :)

I'm a new user of Spark and Scala and have been using both for 6 months without 
too many problems.
Here I'm looking for best practices for using non-serializable classes inside 
closure. I'm using Spark-0.9.0-incubating here with Hadoop 2.2.

Suppose I am using OpenCSV parser to parse an input file. So inside my main :

val sc = new SparkContext("local[2]", "App")
val heyRDD = sc.textFile("...")

val csvparser = new CSVParser(';')
val heyMap = heyRDD.map { line =>
      val temp = csvparser.parseLine(line)
      (temp(1), temp(4))

This gives me a java.io.NotSerializableException: 
au.com.bytecode.opencsv.CSVParser, which seems reasonable.

>From here I could see 3 solutions :
1/ Extending CSVParser with Serialisable properties, which adds a lot of 
boilerplate code if you ask me
2/ Using Kryo Serialization (still need to define a serializer)
3/ Creating an object with an instance of the class I want to use, typically :

object CSVParserPlus {

  val csvParser = new CSVParser(';')

  def parse(line: String) = {

    val heyMap = heyRDD.map { line =>
      val temp = CSVParserPlus.parse(line)
      (temp(1), temp(4))

Third solution works and I don't get how, so I was wondering how worked the 
closure system inside Spark to be able to serialize an object with a 
non-serializable instance. How does that work ? Does it hinder performance ? Is 
it a good solution ? How do you manage this problem ?

Any input would be greatly appreciated

Best regards,


