Hi,

We're using Apache Toree kernel in Jupyter Notebook to communicate with Spark 
and experiencing some difficulties running simple code. The code works from the 
spark-shell but it fails to work in Jupyter Notebook. The idea is to simply to 
take several elements from the dataframe and put them in to a labelled point. 
Whenever, the code is ran from the Jupyter Notebook it throws object not 
serializable error. The code is as follows:

import scala.util.Random.{setSeed, nextDouble}
setSeed(1)

case class Record (
    foo: Double, target: Double, x1: Double, x2: Double, x3: Double) extends 
Serializable

val rows = sc.parallelize(
    (1 to 10).map(_ => Record(
        nextDouble, nextDouble, nextDouble, nextDouble, nextDouble
   ))
)
val df = spark.sqlContext.createDataFrame(rows)
df.registerTempTable("df")

spark.sqlContext.sql("""
  SELECT ROUND(foo, 2) foo,
         ROUND(target, 2) target,
         ROUND(x1, 2) x1,
         ROUND(x2, 2) x2,
         ROUND(x3, 2) x3
  FROM df""").show

// Or if you want to exclude columns
val ignored = List("foo", "target", "x2")
// Map feature names to indices
val featInd = df.columns.diff(ignored).map(df.columns.indexOf(_))

// Get index of target
val targetInd = df.columns.indexOf("target")
val label_data = df.rdd.map(r => org.apache.spark.mllib.regression.LabeledPoint(
   r.getDouble(targetInd), // Get target value
   // Map feature indices to values
    
org.apache.spark.mllib.linalg.Vectors.dense(featInd.map(r.getDouble(_)).toArray)
)).take(2).foreach(println)


Simply tweaking this a bit, it works fine. Tweaking the last bit:


val label_data = df.rdd.map(r => org.apache.spark.mllib.regression.LabeledPoint(
   r.getDouble(1), // Get target value
   // Map feature indices to values
    org.apache.spark.mllib.linalg.Vectors.dense(r.getDouble(2), r.getDouble(3))
)).take(2).foreach(println)

It seems that it can't get head around, when the index to getDouble is 
specified as a variable rather than a hard-coded number. Yet, again this runs 
perfectly on spark-shell. Is it a known issue or should this be reported as a 
bug?

It's running with Spark 2.1.1.


Kind regards
Dr. Marius Vileiniškis
Senior Data Scientist

[logo]
A.Goštauto str. 12A (UNIQ), Vilnius
Danske Group IT Lithuania
Mobile: +37065315325
m...@danskebank.lt<mailto:m...@danskebank.lt>



_______________
Šioje žinuteje esanti informacija gali buti konfidenciali. Jeigu šią žinutę 
gavote per klaidą, prašome grąžinti ją siuntejui atsakant i gautą laišką ir 
iškart ištrinkite žinutę iš Jusu sistemos nekopijuojant, neplatinant ir 
neišsisaugant jos.
Nors esame isitikinę, kad ši žinute ir prie jos esantys priedai nera užkresti 
virusais ar kitaip pažeisti, del ko galetu buti paveiktas kompiuteris ar IT 
sistema, kurioje žinute gauta ir skaitoma, adresatas atidarydamas failą 
prisiima riziką. Mes neatsakome už nuostolius ar žalą, galinčius atsirasti del 
šios žinutes gavimo ar kitokio naudojimo.
_______________
Please note that this message may contain confidential information. If you have 
received this message by mistake, please inform the sender of the mistake by 
sending a reply, then delete the message from your system without making, 
distributing or retaining any copies of it. Although we believe that the 
message and any attachments are free from viruses and other errors that might 
affect the computer or IT system where it is received and read, the recipient 
opens the message at his or her own risk. We assume no responsibility for any 
loss or damage arising from the receipt or use of this message.

Reply via email to