Re: DataFrame Vs RDDs ... Which one to use When ?

Chris Fregly Mon, 28 Dec 2015 14:24:12 -0800

here's a good article that sums it up, in my opinion:

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/


basically, building apps with RDDs is like building with apps with
primitive JVM bytecode.  haha.

@richard:  remember that even if you're currently writing RDDs in
Java/Scala, you're not gaining the code gen/rewrite performance benefits of
the Catalyst optimizer.

i agree with @daniel who suggested that you start with DataFrames and
revert to RDDs only when DataFrames don't give you what you need.

the only time i use RDDs directly these days is when i'm dealing with a
Spark library that has not yet moved to DataFrames - ie. GraphX - and it's
kind of annoying switching back and forth.

almost everything you need should be in the DataFrame API.

Datasets are similar to RDDs, but give you strong compile-time typing,
tabular structure, and Catalyst optimizations.

hopefully Datasets is the last API we see from Spark SQL...  i'm getting
tired of re-writing slides and book chapters!  :)

On Mon, Dec 28, 2015 at 4:55 PM, Richard Eggert <richard.egg...@gmail.com>
wrote:

> One advantage of RDD's over DataFrames is that RDD's allow you to use your
> own data types, whereas DataFrames are backed by RDD's of Record objects,
> which are pretty flexible but don't give you much in the way of
> compile-time type checking. If you have an RDD of case class elements or
> JSON, then Spark SQL can automatically figure out how to convert it into an
> RDD of Record objects (and therefore a DataFrame), but there's no way to
> automatically go the other way (from DataFrame/Record back to custom types).
>
> In general, you can ultimately do more with RDDs than DataFrames, but
> DataFrames give you a lot of niceties (automatic query optimization, table
> joins, SQL-like syntax, etc.) for free, and can avoid some of the runtime
> overhead associated with writing RDD code in a non-JVM language (such as
> Python or R), since the query optimizer is effectively creating the
> required JVM code under the hood. There's little to no performance benefit
> if you're already writing Java or Scala code, however (and RDD-based code
> may actually perform better in some cases, if you're willing to carefully
> tune your code).
>
> On Mon, Dec 28, 2015 at 3:05 PM, Daniel Siegmann <
> daniel.siegm...@teamaol.com> wrote:
>
>> DataFrames are a higher level API for working with tabular data - RDDs
>> are used underneath. You can use either and easily convert between them in
>> your code as necessary.
>>
>> DataFrames provide a nice abstraction for many cases, so it may be easier
>> to code against them. Though if you're used to thinking in terms of
>> collections rather than tables, you may find RDDs more natural. Data frames
>> can also be faster, since Spark will do some optimizations under the hood -
>> if you are using PySpark, this will avoid the overhead. Data frames may
>> also perform better if you're reading structured data, such as a Hive table
>> or Parquet files.
>>
>> I recommend you prefer data frames, switching over to RDDs as necessary
>> (when you need to perform an operation not supported by data frames / Spark
>> SQL).
>>
>> HOWEVER (and this is a big one), Spark 1.6 will have yet another API -
>> datasets. The release of Spark 1.6 is currently being finalized and I would
>> expect it in the next few days. You will probably want to use the new API
>> once it's available.
>>
>>
>> On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I am new bee to spark and a bit confused about RDDs and DataFames in
>>> Spark.
>>> Can somebody explain me with the use cases which one to use when ?
>>>
>>> Would really appreciate the clarification .
>>>
>>> Thanks,
>>> Divya
>>>
>>
>>
>
>
> --
> Rich
>



-- 

*Chris Fregly*
Principal Data Solutions Engineer
IBM Spark Technology Center, San Francisco, CA
http://spark.tc | http://advancedspark.com

Re: DataFrame Vs RDDs ... Which one to use When ?

Reply via email to