Thank you for the quick answer, looks good to me Though that brings me to another question. Suppose we want to open a connection to a database, an ElasticSearch, etc...
I now have two proceedings : 1/ use .mapPartitions and setup the connection at the start of each partition, so I get a connection per partition 2/ use a singleton object, which loads a connection per executor if my understanding is correct I would have used the second possibility, so I don't create a new connection for a partition each time the partition fails to compute for whatever reason. I also don't have a lot of connections in parallel because I have only one connection per worker. If I have 200 partitions in parallel, that makes 200 connections. But in the second case a partition could kill the connection on the worker during computation and because that connection is shared for all tasks of the executor, all partitions would fail. Also, only one connection object would have to manage 200 partitions trying to output to Elasticsearch/database/etc...that may be bad performance-wise. Can't see a case where second is preferable for now. Doesn't seem I could use that singleton object to share data within an executor sadly... Thanks for the input Fanilo -----Message d'origine----- De : Sean Owen [mailto:so...@cloudera.com] Envoyé : jeudi 4 septembre 2014 15:36 À : Andrianasolo Fanilo Cc : user@spark.apache.org Objet : Re: Object serialisation inside closures In your original version, the object is referenced by the function but it's on the driver, and so has to be serialized. This leads to an error since it's not serializable. Instead, you want to recreate the object locally on each of the remote machines. In your third version you are holding the parser in a static member of a class, in your Scala object. When you call the parse method, you're calling it on the instance of the CSVParserPlus class that was loaded on the remote worker. It loads and creates its own copy of the parser. A maybe more compact solution is to use mapPartitions, and create the parser once at the start. This avoids needing this static / singleton pattern, but also means the parser is created only once per partition. On Thu, Sep 4, 2014 at 2:29 PM, Andrianasolo Fanilo <fanilo.andrianas...@worldline.com> wrote: > Hello Spark fellows J > > > > I’m a new user of Spark and Scala and have been using both for 6 > months without too many problems. > > Here I’m looking for best practices for using non-serializable classes > inside closure. I’m using Spark-0.9.0-incubating here with Hadoop 2.2. > > > > Suppose I am using OpenCSV parser to parse an input file. So inside my > main > : > > > > val sc = new SparkContext("local[2]", "App") > > val heyRDD = sc.textFile("…") > > > > val csvparser = new CSVParser(';') > > val heyMap = heyRDD.map { line => > > val temp = csvparser.parseLine(line) > > (temp(1), temp(4)) > > } > > > > > > This gives me a java.io.NotSerializableException: > au.com.bytecode.opencsv.CSVParser, which seems reasonable. > > > > From here I could see 3 solutions : > > 1/ Extending CSVParser with Serialisable properties, which adds a lot > of boilerplate code if you ask me > > 2/ Using Kryo Serialization (still need to define a serializer) > > 3/ Creating an object with an instance of the class I want to use, > typically > : > > > > object CSVParserPlus { > > > > val csvParser = new CSVParser(';') > > > > def parse(line: String) = { > > csvParser.parseLine(line) > > } > > } > > > > > > val heyMap = heyRDD.map { line => > > val temp = CSVParserPlus.parse(line) > > (temp(1), temp(4)) > > } > > > > Third solution works and I don’t get how, so I was wondering how > worked the closure system inside Spark to be able to serialize an > object with a non-serializable instance. How does that work ? Does it hinder > performance ? > Is it a good solution ? How do you manage this problem ? > > > > Any input would be greatly appreciated > > > > Best regards, > > Fanilo > > > ________________________________ > > Ce message et les pièces jointes sont confidentiels et réservés à > l'usage exclusif de ses destinataires. Il peut également être protégé > par le secret professionnel. Si vous recevez ce message par erreur, > merci d'en avertir immédiatement l'expéditeur et de le détruire. > L'intégrité du message ne pouvant être assurée sur Internet, la > responsabilité de Worldline ne pourra être recherchée quant au contenu > de ce message. Bien que les meilleurs efforts soient faits pour > maintenir cette transmission exempte de tout virus, l'expéditeur ne > donne aucune garantie à cet égard et sa responsabilité ne saurait être > recherchée pour tout dommage résultant d'un virus transmis. > > This e-mail and the documents attached are confidential and intended > solely for the addressee; it may also be privileged. If you receive > this e-mail in error, please notify the sender immediately and destroy > it. As its integrity cannot be secured on the Internet, the Worldline > liability cannot be triggered for the message content. Although the > sender endeavours to maintain a computer virus-free network, the > sender does not warrant that this transmission is virus-free and will > not be liable for any damages resulting from any virus transmitted. Ce message et les pièces jointes sont confidentiels et réservés à l'usage exclusif de ses destinataires. Il peut également être protégé par le secret professionnel. Si vous recevez ce message par erreur, merci d'en avertir immédiatement l'expéditeur et de le détruire. L'intégrité du message ne pouvant être assurée sur Internet, la responsabilité de Worldline ne pourra être recherchée quant au contenu de ce message. Bien que les meilleurs efforts soient faits pour maintenir cette transmission exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne saurait être recherchée pour tout dommage résultant d'un virus transmis. This e-mail and the documents attached are confidential and intended solely for the addressee; it may also be privileged. If you receive this e-mail in error, please notify the sender immediately and destroy it. As its integrity cannot be secured on the Internet, the Worldline liability cannot be triggered for the message content. Although the sender endeavours to maintain a computer virus-free network, the sender does not warrant that this transmission is virus-free and will not be liable for any damages resulting from any virus transmitted.