Let me share my understanding.

If we view Spark as analytics OS, RDD APIs are like OS system calls. These
low-level system calls can be called in the program languages like C.
DataFrame and Dataset APIs are like higher-level programming languages.
They hide the low level complexity and the compiler (i.e., Catalyst) will
optimize your programs. For most users, the SQL, DataFrame and Dataset APIs
are good enough to satisfy most requirements.

In the near future, I guess GUI interfaces of Spark will be available soon.
Spark users (e.g, CEOs) might not need to know what are RDDs at all. They
can analyze their data by clicking a few buttons, instead of writing the
programs. : )

Wish Spark will be the most popular analytics OS in the world! : )

Have a good holiday everyone!

Xiao Li



2015-11-23 17:56 GMT-08:00 Jakob Odersky <joder...@gmail.com>:

> Thanks Michael, that helped me a lot!
>
> On 23 November 2015 at 17:47, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Here is how I view the relationship between the various components of
>> Spark:
>>
>>  - *RDDs - *a low level API for expressing DAGs that will be executed in
>> parallel by Spark workers
>>  - *Catalyst -* an internal library for expressing trees that we use to
>> build relational algebra and expression evaluation.  There's also an
>> optimizer and query planner than turns these into logical concepts into RDD
>> actions.
>>  - *Tungsten -* an internal optimized execution engine that can compile
>> catalyst expressions into efficient java bytecode that operates directly on
>> serialized binary data.  It also has nice low level data structures /
>> algorithms like hash tables and sorting that operate directly on serialized
>> data.  These are used by the physical nodes that are produced by the query
>> planner (and run inside of RDD operation on workers).
>>  - *DataFrames - *a user facing API that is similar to SQL/LINQ for
>> constructing dataflows that are backed by catalyst logical plans
>>  - *Datasets* - a user facing API that is similar to the RDD API for
>> constructing dataflows that are backed by catalyst logical plans
>>
>> So everything is still operating on RDDs but I anticipate most users will
>> eventually migrate to the higher level APIs for convenience and automatic
>> optimization
>>
>> On Mon, Nov 23, 2015 at 4:18 PM, Jakob Odersky <joder...@gmail.com>
>> wrote:
>>
>>> Hi everyone,
>>>
>>> I'm doing some reading-up on all the newer features of Spark such as
>>> DataFrames, DataSets and Project Tungsten. This got me a bit confused on
>>> the relation between all these concepts.
>>>
>>> When starting to learn Spark, I read a book and the original paper on
>>> RDDs, this lead me to basically think "Spark == RDDs".
>>> Now, looking into DataFrames, I read that they are basically
>>> (distributed) collections with an associated schema, thus enabling
>>> declarative queries and optimization (through Catalyst). I am uncertain how
>>> DataFrames relate to RDDs: are DataFrames transformed to operations on RDDs
>>> once they have been optimized? Or are they completely different concepts?
>>> In case of the latter, do DataFrames still use the Spark scheduler and get
>>> broken down into a DAG of stages and tasks?
>>>
>>> Regarding project Tungsten, where does it fit in? To my understanding it
>>> is used to efficiently cache data in memory and may also be used to
>>> generate query code for specialized hardware. This sounds as though it
>>> would work on Spark's worker nodes, however it would also only work with
>>> schema-associated data (aka DataFrames), thus leading me to the conclusion
>>> that RDDs and DataFrames do not share a common backend which in turn
>>> contradicts my conception of "Spark == RDDs".
>>>
>>> Maybe I missed the obvious as these questions seem pretty basic, however
>>> I was unable to find clear answers in Spark documentation or related papers
>>> and talks. I would greatly appreciate any clarifications.
>>>
>>> thanks,
>>> --Jakob
>>>
>>
>>
>

Reply via email to