By using DataFrames you will not need to specify RDD operations explicity,
instead the operations are built and optimized for by using the information
available in the DataFrame's schema.
The only draw-back I can think of is some loss of generality: given a
dataframe containing types A, you will be able to include types B even if B
is a sub-type of A. However, in real use-cases I have never run into this
problem.

I once had a related question on RDDs and DataFrames, here is the answer I
got from Michael Armbrust:

Here is how I view the relationship between the various components of Spark:
>
>  - *RDDs - *a low level API for expressing DAGs that will be executed in
> parallel by Spark workers
>  - *Catalyst -* an internal library for expressing trees that we use to
> build relational algebra and expression evaluation.  There's also an
> optimizer and query planner than turns these into logical concepts into RDD
> actions.
>  - *Tungsten -* an internal optimized execution engine that can compile
> catalyst expressions into efficient java bytecode that operates directly on
> serialized binary data.  It also has nice low level data structures /
> algorithms like hash tables and sorting that operate directly on serialized
> data.  These are used by the physical nodes that are produced by the
> query planner (and run inside of RDD operation on workers).
>  - *DataFrames - *a user facing API that is similar to SQL/LINQ for
> constructing dataflows that are backed by catalyst logical plans
>  - *Datasets* - a user facing API that is similar to the RDD API for
> constructing dataflows that are backed by catalyst logical plans
>
> So everything is still operating on RDDs but I anticipate most users will
> eventually migrate to the higher level APIs for convenience and automatic
> optimization
>

Hope that also helps you get an idea of the different concepts and their
potential advantages/drawbacks.

Reply via email to