I'll throw a thought in here. Dataframes are nice if your data is uniform and clean with consistent schema. However in many big data problems this is seldom the case.
Sent from my Verizon Wireless 4G LTE smartphone -------- Original message -------- From: Chris Fregly <ch...@fregly.com> Date: 12/28/2015 5:22 PM (GMT-05:00) To: Richard Eggert <richard.egg...@gmail.com> Cc: Daniel Siegmann <daniel.siegm...@teamaol.com>, Divya Gehlot <divya.htco...@gmail.com>, "user @spark" <user@spark.apache.org> Subject: Re: DataFrame Vs RDDs ... Which one to use When ? here's a good article that sums it up, in my opinion: https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/ basically, building apps with RDDs is like building with apps with primitive JVM bytecode. haha. @richard: remember that even if you're currently writing RDDs in Java/Scala, you're not gaining the code gen/rewrite performance benefits of the Catalyst optimizer. i agree with @daniel who suggested that you start with DataFrames and revert to RDDs only when DataFrames don't give you what you need. the only time i use RDDs directly these days is when i'm dealing with a Spark library that has not yet moved to DataFrames - ie. GraphX - and it's kind of annoying switching back and forth. almost everything you need should be in the DataFrame API. Datasets are similar to RDDs, but give you strong compile-time typing, tabular structure, and Catalyst optimizations. hopefully Datasets is the last API we see from Spark SQL... i'm getting tired of re-writing slides and book chapters! :) On Mon, Dec 28, 2015 at 4:55 PM, Richard Eggert <richard.egg...@gmail.com> wrote: One advantage of RDD's over DataFrames is that RDD's allow you to use your own data types, whereas DataFrames are backed by RDD's of Record objects, which are pretty flexible but don't give you much in the way of compile-time type checking. If you have an RDD of case class elements or JSON, then Spark SQL can automatically figure out how to convert it into an RDD of Record objects (and therefore a DataFrame), but there's no way to automatically go the other way (from DataFrame/Record back to custom types). In general, you can ultimately do more with RDDs than DataFrames, but DataFrames give you a lot of niceties (automatic query optimization, table joins, SQL-like syntax, etc.) for free, and can avoid some of the runtime overhead associated with writing RDD code in a non-JVM language (such as Python or R), since the query optimizer is effectively creating the required JVM code under the hood. There's little to no performance benefit if you're already writing Java or Scala code, however (and RDD-based code may actually perform better in some cases, if you're willing to carefully tune your code). On Mon, Dec 28, 2015 at 3:05 PM, Daniel Siegmann <daniel.siegm...@teamaol.com> wrote: DataFrames are a higher level API for working with tabular data - RDDs are used underneath. You can use either and easily convert between them in your code as necessary. DataFrames provide a nice abstraction for many cases, so it may be easier to code against them. Though if you're used to thinking in terms of collections rather than tables, you may find RDDs more natural. Data frames can also be faster, since Spark will do some optimizations under the hood - if you are using PySpark, this will avoid the overhead. Data frames may also perform better if you're reading structured data, such as a Hive table or Parquet files. I recommend you prefer data frames, switching over to RDDs as necessary (when you need to perform an operation not supported by data frames / Spark SQL). HOWEVER (and this is a big one), Spark 1.6 will have yet another API - datasets. The release of Spark 1.6 is currently being finalized and I would expect it in the next few days. You will probably want to use the new API once it's available. On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot <divya.htco...@gmail.com> wrote: Hi, I am new bee to spark and a bit confused about RDDs and DataFames in Spark. Can somebody explain me with the use cases which one to use when ? Would really appreciate the clarification . Thanks, Divya -- Rich -- Chris FreglyPrincipal Data Solutions EngineerIBM Spark Technology Center, San Francisco, CAhttp://spark.tc | http://advancedspark.com