Re: Use case for RDD and Data Frame

Chandeep Singh Tue, 16 Feb 2016 10:26:02 -0800
Ah. My bad! :)

> On Feb 16, 2016, at 6:24 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:
> 
> Thanks Chandeep.
>  
> Andy Grove, the author earlier on pointed to that article in an earlier 
> thread J
>  
>  
>  
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This 
> message is for the designated recipient only, if you are not the intended 
> recipient, you should destroy it immediately. Any information in this message 
> shall not be understood as given or endorsed by Peridale Technology Ltd, its 
> subsidiaries or their employees, unless expressly so stated. It is the 
> responsibility of the recipient to ensure that this email is virus free, 
> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
> employees accept any responsibility.
>  
>  
> From: Chandeep Singh [mailto:c...@chandeep.com] 
> Sent: 16 February 2016 18:17
> To: Mich Talebzadeh <m...@peridale.co.uk>
> Cc: Ashok Kumar <ashok34...@yahoo.com>; User <user@spark.apache.org>
> Subject: Re: Use case for RDD and Data Frame
>  
> Here is another interesting post.
>  
> http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
>  
> <http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer>
>  
>> On Feb 16, 2016, at 6:01 PM, Mich Talebzadeh <m...@peridale.co.uk 
>> <mailto:m...@peridale.co.uk>> wrote:
>>  
>> Hi,
>>  
>> A Resilient  Distributed Dataset (RDD) is a heap of data distributed among 
>> all nodes of cluster. It is basically raw data and that is all about it with 
>> little optimization on it. Remember data is not much of a value until it is 
>> turned into information.
>>  
>> On the other hand a DataFrame is equivalent to a table in RDBMS akin to  a 
>> table in Oracle or Sybase. In other words a two-dimensional array-like 
>> structure, in which each column contains measurements on one variable, and 
>> each row contains one case.
>>  
>> So, a DataFrame by definition has additional metadata due to its tabular 
>> format, which allows Spark Optimizer AKA Catalyst  to take advantage of this 
>> tabular format for certain optimizations. So still after so many years, the 
>> relational model is arguably the most elegant model known and used and 
>> emulated everywhere. 
>>  
>> Much like a table in RDBMS, a DataFrame keeps track of the schema and 
>> supports various relational operations that lead to more optimized 
>> execution. Essentially each DataFrame object represents a logical plan but 
>> because of their "lazy" nature no execution occurs until the user calls a 
>> specific "output operation". This is very important to remember. You can go 
>> from a DataFrame to an RDD via its rdd method. You can go from an RDD to a 
>> DataFrame (if the RDD is in a tabular format) via the toDF method.
>>  
>> In general it is recommended to use a DataFrame where possible due to the 
>> built in query optimization.
>>  
>> For those familiar with SQL a DataFrame can be conveniently registered as a 
>> temporary table and SQL operations can be performed on it.
>>  
>> Case in point I am looking for all my replication server log files 
>> compressed and stored in an HDFS directory for error on a specific connection
>>  
>> //create an RDD
>> val rdd = sc.textFile("/test/REP_LOG.gz")
>> //convert it to Data Frame
>> val df = rdd.toDF("line")
>> //register the line as a temporary table
>> df.registerTempTable("t")
>> println("\n Search for ERROR plus another word in table t\n")
>> sql("select * from t WHERE line like '%ERROR%' and line like 
>> '%hiveserver2.asehadoop%'").collect().foreach(println)
>>  
>> Alternatively you can just use method calls on the DataFrame itself to 
>> filter out the word
>>  
>> df.filter(col("line").like("%ERROR%")).collect.foreach(println)
>>  
>> HTH,
>>  
>> Dr Mich Talebzadeh
>>  
>> LinkedIn  
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>  
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>  
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>  
>> NOTE: The information in this email is proprietary and confidential. This 
>> message is for the designated recipient only, if you are not the intended 
>> recipient, you should destroy it immediately. Any information in this 
>> message shall not be understood as given or endorsed by Peridale Technology 
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is 
>> the responsibility of the recipient to ensure that this email is virus free, 
>> therefore neither Peridale Technology Ltd, its subsidiaries nor their 
>> employees accept any responsibility.
>>  
>>  
>> From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID 
>> <mailto:ashok34...@yahoo.com.INVALID>] 
>> Sent: 16 February 2016 16:06
>> To: User <user@spark.apache.org <mailto:user@spark.apache.org>>
>> Subject: Use case for RDD and Data Frame
>>  
>> Gurus,
>>  
>> What are the main differences between a Resilient Distributed Data (RDD) and 
>> Data Frame (DF)
>>  
>> Where one can use RDD without transforming it to DF?
>>  
>> Regards and obliged
Re: Use case for RDD and Data Frame

Reply via email to