Re: RDD and Dataframes

2016-07-15 Thread Taotao.Li
hi, brccosta, databricks have just posted a blog about *RDD, Dataframe and
Dataset*, you can check it here :
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
 , which will be very helpful for you I think.

*___*
Quant | Engineer | Boy
*___*
*blog*:http://litaotao.github.io
<http://litaotao.github.io/?utm_source=spark_mail>
*github*: www.github.com/litaotao


On Sat, Jul 16, 2016 at 7:53 AM, RK Aduri  wrote:

> DataFrames uses RDDs as internal implementation of its structure. It
> doesn't
> convert to RDD but uses RDD partitions to produce logical plan.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306p27346.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
*___*
Quant | Engineer | Boy
*___*
*blog*:http://litaotao.github.io
<http://litaotao.github.io?utm_source=spark_mail>
*github*: www.github.com/litaotao


Re: RDD and Dataframes

2016-07-15 Thread RK Aduri
DataFrames uses RDDs as internal implementation of its structure. It doesn't
convert to RDD but uses RDD partitions to produce logical plan.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306p27346.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: RDD and Dataframes

2016-07-07 Thread Bruno Costa
Thank you for the answer.

One of the optimizations of Dataframes/Datasets (beyond the Catalyst) are
the Encoders (Project Tungsten), which translate domain objects into
Spark's internal format (binary). By using encoders, the data is not
managed by the Java Virtual Machine anymore (which increase the memory
using with metadata, and the processing time with Garbage Collector
actuation). However, if it will be converted to an RDD internally, such RDD
will also not be managed by JVM, is that right? Instead, there weren't
really optimization with enconders...

2016-07-07 9:10 GMT-03:00 Rishi Mishra :

> Yes, finally it will be converted to an RDD internally. However DataFrame
> queries are passed through catalyst , which provides several optimizations
> e.g. code generation, intelligent shuffle etc , which is not the case for
> pure RDDs.
>
> Regards,
> Rishitesh Mishra,
> SnappyData . (http://www.snappydata.io/)
>
> https://in.linkedin.com/in/rishiteshmishra
>
> On Thu, Jul 7, 2016 at 4:50 PM, brccosta  wrote:
>
>> Dear guys,
>>
>> I'm investigating the differences between RDDs and Dataframes/Datasets. I
>> couldn't find the answer for this question: Dataframes acts as a new layer
>> in the Spark stack? I mean, in the execution there is a conversion to RDD?
>>
>> For example, if I create a Dataframe and perform a query, in the final
>> step
>> it will be transformed into a RDD to be executed in Spark?
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


-- 
 Bruno.


Re: RDD and Dataframes

2016-07-07 Thread Rishi Mishra
Yes, finally it will be converted to an RDD internally. However DataFrame
queries are passed through catalyst , which provides several optimizations
e.g. code generation, intelligent shuffle etc , which is not the case for
pure RDDs.

Regards,
Rishitesh Mishra,
SnappyData . (http://www.snappydata.io/)

https://in.linkedin.com/in/rishiteshmishra

On Thu, Jul 7, 2016 at 4:50 PM, brccosta  wrote:

> Dear guys,
>
> I'm investigating the differences between RDDs and Dataframes/Datasets. I
> couldn't find the answer for this question: Dataframes acts as a new layer
> in the Spark stack? I mean, in the execution there is a conversion to RDD?
>
> For example, if I create a Dataframe and perform a query, in the final step
> it will be transformed into a RDD to be executed in Spark?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


RDD and Dataframes

2016-07-07 Thread brccosta
Dear guys,

I'm investigating the differences between RDDs and Dataframes/Datasets. I
couldn't find the answer for this question: Dataframes acts as a new layer
in the Spark stack? I mean, in the execution there is a conversion to RDD?

For example, if I create a Dataframe and perform a query, in the final step
it will be transformed into a RDD to be executed in Spark?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-and-Dataframes-tp27306.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-07-02 Thread Davies Liu
On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl  wrote:
> In pyspark, when I convert from rdds to dataframes it looks like the rdd is
> being materialized/collected/repartitioned before it's converted to a
> dataframe.

It's not true. When converting a RDD to dataframe, it only take a few of rows to
infer the types, no other collect/repartition will happen.

> Just wondering if there's any guidelines for doing this conversion and
> whether it's best to do it early to get the performance benefits of
> dataframes or weigh that against the size/number of items in the rdd.

It's better to do it as early as possible, I think.

> Thanks,
>
> -Axel
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-06-29 Thread Axel Dahl
In pyspark, when I convert from rdds to dataframes it looks like the rdd is
being materialized/collected/repartitioned before it's converted to a
dataframe.

Just wondering if there's any guidelines for doing this conversion and
whether it's best to do it early to get the performance benefits of
dataframes or weigh that against the size/number of items in the rdd.

Thanks,

-Axel