Hi, We're using Apache Toree kernel in Jupyter Notebook to communicate with Spark and experiencing some difficulties running simple code. The code works from the spark-shell but it fails to work in Jupyter Notebook. The idea is to simply to take several elements from the dataframe and put them in to a labelled point. Whenever, the code is ran from the Jupyter Notebook it throws object not serializable error. The code is as follows:
import scala.util.Random.{setSeed, nextDouble} setSeed(1) case class Record ( foo: Double, target: Double, x1: Double, x2: Double, x3: Double) extends Serializable val rows = sc.parallelize( (1 to 10).map(_ => Record( nextDouble, nextDouble, nextDouble, nextDouble, nextDouble )) ) val df = spark.sqlContext.createDataFrame(rows) df.registerTempTable("df") spark.sqlContext.sql(""" SELECT ROUND(foo, 2) foo, ROUND(target, 2) target, ROUND(x1, 2) x1, ROUND(x2, 2) x2, ROUND(x3, 2) x3 FROM df""").show // Or if you want to exclude columns val ignored = List("foo", "target", "x2") // Map feature names to indices val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_)) // Get index of target val targetInd = df.columns.indexOf("target") val label_data = df.rdd.map(r => org.apache.spark.mllib.regression.LabeledPoint( r.getDouble(targetInd), // Get target value // Map feature indices to values org.apache.spark.mllib.linalg.Vectors.dense(featInd.map(r.getDouble(_)).toArray) )).take(2).foreach(println) Simply tweaking this a bit, it works fine. Tweaking the last bit: val label_data = df.rdd.map(r => org.apache.spark.mllib.regression.LabeledPoint( r.getDouble(1), // Get target value // Map feature indices to values org.apache.spark.mllib.linalg.Vectors.dense(r.getDouble(2), r.getDouble(3)) )).take(2).foreach(println) It seems that it can't get head around, when the index to getDouble is specified as a variable rather than a hard-coded number. Yet, again this runs perfectly on spark-shell. Is it a known issue or should this be reported as a bug? It's running with Spark 2.1.1. Kind regards Dr. Marius Vileiniškis Senior Data Scientist [logo] A.Goštauto str. 12A (UNIQ), Vilnius Danske Group IT Lithuania Mobile: +37065315325 m...@danskebank.lt<mailto:m...@danskebank.lt> _______________ Šioje žinuteje esanti informacija gali buti konfidenciali. Jeigu šią žinutę gavote per klaidą, prašome grąžinti ją siuntejui atsakant i gautą laišką ir iškart ištrinkite žinutę iš Jusu sistemos nekopijuojant, neplatinant ir neišsisaugant jos. Nors esame isitikinę, kad ši žinute ir prie jos esantys priedai nera užkresti virusais ar kitaip pažeisti, del ko galetu buti paveiktas kompiuteris ar IT sistema, kurioje žinute gauta ir skaitoma, adresatas atidarydamas failą prisiima riziką. Mes neatsakome už nuostolius ar žalą, galinčius atsirasti del šios žinutes gavimo ar kitokio naudojimo. _______________ Please note that this message may contain confidential information. If you have received this message by mistake, please inform the sender of the mistake by sending a reply, then delete the message from your system without making, distributing or retaining any copies of it. Although we believe that the message and any attachments are free from viruses and other errors that might affect the computer or IT system where it is received and read, the recipient opens the message at his or her own risk. We assume no responsibility for any loss or damage arising from the receipt or use of this message.