:) Just realized you didn't get your original question answered though:
scala> import sqlContext.implicits._ import sqlContext.implicits._ scala> case class Person(age: Long, name: String) defined class Person scala> val df = Seq(Person(24, "pedro"), Person(22, "fritz")).toDF() df: org.apache.spark.sql.DataFrame = [age: bigint, name: string] scala> df.select("age") res2: org.apache.spark.sql.DataFrame = [age: bigint] scala> df.select("age").collect.map(_.getLong(0)) res3: Array[Long] = Array(24, 22) scala> import org.apache.spark.sql.Row import org.apache.spark.sql.Row scala> df.collect.flatMap { | case Row(age: Long, name: String) => Seq(Tuple1(age)) | case _ => Seq() | } res7: Array[(Long,)] = Array((24,), (22,)) These docs are helpful http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Row (1.6 docs, but should be similar in 2.0) On Tue, Jul 26, 2016 at 7:08 AM, Gourav Sengupta <gourav.sengu...@gmail.com> wrote: > And Pedro has made sense of a world running amok, scared, and drunken > stupor. > > Regards, > Gourav > > On Tue, Jul 26, 2016 at 2:01 PM, Pedro Rodriguez <ski.rodrig...@gmail.com> > wrote: > >> I am not 100% as I haven't tried this out, but there is a huge difference >> between the two. Both foreach and collect are actions irregardless of >> whether or not the data frame is empty. >> >> Doing a collect will bring all the results back to the driver, possibly >> forcing it to run out of memory. Foreach will apply your function to each >> element of the DataFrame, but will do so across the cluster. This behavior >> is useful for when you need to do something custom for each element >> (perhaps save to a db for which there is no driver or something custom like >> make an http request per element, careful here though due to overhead cost). >> >> In your example, I am going to assume that hrecords is something like a >> list buffer. The reason that will be empty is that each worker will get >> sent an empty list (its captured in the closure for foreach) and append to >> it. The instance of the list at the driver doesn't know about what happened >> at the workers so its empty. >> >> I don't know why Chanh's comment applies here since I am guessing the df >> is not empty. >> >> On Tue, Jul 26, 2016 at 1:53 AM, kevin <kiss.kevin...@gmail.com> wrote: >> >>> thank you Chanh >>> >>> 2016-07-26 15:34 GMT+08:00 Chanh Le <giaosu...@gmail.com>: >>> >>>> Hi Ken, >>>> >>>> *blacklistDF -> just DataFrame * >>>> Spark is lazy until you call something like* collect, take, write* it >>>> will execute the hold process *like you do map or filter before you >>>> collect*. >>>> That mean until you call collect spark* do nothing* so you df would >>>> not have any data -> can’t call foreach. >>>> Call collect execute the process -> get data -> foreach is ok. >>>> >>>> >>>> On Jul 26, 2016, at 2:30 PM, kevin <kiss.kevin...@gmail.com> wrote: >>>> >>>> blacklistDF.collect() >>>> >>>> >>>> >>> >> >> >> -- >> Pedro Rodriguez >> PhD Student in Distributed Machine Learning | CU Boulder >> UC Berkeley AMPLab Alumni >> >> ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 >> Github: github.com/EntilZha | LinkedIn: >> https://www.linkedin.com/in/pedrorodriguezscience >> >> > -- Pedro Rodriguez PhD Student in Distributed Machine Learning | CU Boulder UC Berkeley AMPLab Alumni ski.rodrig...@gmail.com | pedrorodriguez.io | 909-353-4423 Github: github.com/EntilZha | LinkedIn: https://www.linkedin.com/in/pedrorodriguezscience