here's a good article that sums it up, in my opinion: https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
basically, building apps with RDDs is like building with apps with primitive JVM bytecode. haha. @richard: remember that even if you're currently writing RDDs in Java/Scala, you're not gaining the code gen/rewrite performance benefits of the Catalyst optimizer. i agree with @daniel who suggested that you start with DataFrames and revert to RDDs only when DataFrames don't give you what you need. the only time i use RDDs directly these days is when i'm dealing with a Spark library that has not yet moved to DataFrames - ie. GraphX - and it's kind of annoying switching back and forth. almost everything you need should be in the DataFrame API. Datasets are similar to RDDs, but give you strong compile-time typing, tabular structure, and Catalyst optimizations. hopefully Datasets is the last API we see from Spark SQL... i'm getting tired of re-writing slides and book chapters! :) On Mon, Dec 28, 2015 at 4:55 PM, Richard Eggert <richard.egg...@gmail.com> wrote: > One advantage of RDD's over DataFrames is that RDD's allow you to use your > own data types, whereas DataFrames are backed by RDD's of Record objects, > which are pretty flexible but don't give you much in the way of > compile-time type checking. If you have an RDD of case class elements or > JSON, then Spark SQL can automatically figure out how to convert it into an > RDD of Record objects (and therefore a DataFrame), but there's no way to > automatically go the other way (from DataFrame/Record back to custom types). > > In general, you can ultimately do more with RDDs than DataFrames, but > DataFrames give you a lot of niceties (automatic query optimization, table > joins, SQL-like syntax, etc.) for free, and can avoid some of the runtime > overhead associated with writing RDD code in a non-JVM language (such as > Python or R), since the query optimizer is effectively creating the > required JVM code under the hood. There's little to no performance benefit > if you're already writing Java or Scala code, however (and RDD-based code > may actually perform better in some cases, if you're willing to carefully > tune your code). > > On Mon, Dec 28, 2015 at 3:05 PM, Daniel Siegmann < > daniel.siegm...@teamaol.com> wrote: > >> DataFrames are a higher level API for working with tabular data - RDDs >> are used underneath. You can use either and easily convert between them in >> your code as necessary. >> >> DataFrames provide a nice abstraction for many cases, so it may be easier >> to code against them. Though if you're used to thinking in terms of >> collections rather than tables, you may find RDDs more natural. Data frames >> can also be faster, since Spark will do some optimizations under the hood - >> if you are using PySpark, this will avoid the overhead. Data frames may >> also perform better if you're reading structured data, such as a Hive table >> or Parquet files. >> >> I recommend you prefer data frames, switching over to RDDs as necessary >> (when you need to perform an operation not supported by data frames / Spark >> SQL). >> >> HOWEVER (and this is a big one), Spark 1.6 will have yet another API - >> datasets. The release of Spark 1.6 is currently being finalized and I would >> expect it in the next few days. You will probably want to use the new API >> once it's available. >> >> >> On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot <divya.htco...@gmail.com> >> wrote: >> >>> Hi, >>> I am new bee to spark and a bit confused about RDDs and DataFames in >>> Spark. >>> Can somebody explain me with the use cases which one to use when ? >>> >>> Would really appreciate the clarification . >>> >>> Thanks, >>> Divya >>> >> >> > > > -- > Rich > -- *Chris Fregly* Principal Data Solutions Engineer IBM Spark Technology Center, San Francisco, CA http://spark.tc | http://advancedspark.com