Ah. My bad! :)
> On Feb 16, 2016, at 6:24 PM, Mich Talebzadeh <m...@peridale.co.uk> wrote:
>
> Thanks Chandeep.
>
> Andy Grove, the author earlier on pointed to that article in an earlier
> thread J
>
>
>
> Dr Mich Talebzadeh
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this message
> shall not be understood as given or endorsed by Peridale Technology Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>
>
> From: Chandeep Singh [mailto:c...@chandeep.com]
> Sent: 16 February 2016 18:17
> To: Mich Talebzadeh <m...@peridale.co.uk>
> Cc: Ashok Kumar <ashok34...@yahoo.com>; User <user@spark.apache.org>
> Subject: Re: Use case for RDD and Data Frame
>
> Here is another interesting post.
>
> http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer
>
> <http://www.kdnuggets.com/2016/02/apache-spark-rdd-dataframe-dataset.html?utm_content=buffer31ce5&utm_medium=social&utm_source=twitter.com&utm_campaign=buffer>
>
>> On Feb 16, 2016, at 6:01 PM, Mich Talebzadeh <m...@peridale.co.uk
>> <mailto:m...@peridale.co.uk>> wrote:
>>
>> Hi,
>>
>> A Resilient Distributed Dataset (RDD) is a heap of data distributed among
>> all nodes of cluster. It is basically raw data and that is all about it with
>> little optimization on it. Remember data is not much of a value until it is
>> turned into information.
>>
>> On the other hand a DataFrame is equivalent to a table in RDBMS akin to a
>> table in Oracle or Sybase. In other words a two-dimensional array-like
>> structure, in which each column contains measurements on one variable, and
>> each row contains one case.
>>
>> So, a DataFrame by definition has additional metadata due to its tabular
>> format, which allows Spark Optimizer AKA Catalyst to take advantage of this
>> tabular format for certain optimizations. So still after so many years, the
>> relational model is arguably the most elegant model known and used and
>> emulated everywhere.
>>
>> Much like a table in RDBMS, a DataFrame keeps track of the schema and
>> supports various relational operations that lead to more optimized
>> execution. Essentially each DataFrame object represents a logical plan but
>> because of their "lazy" nature no execution occurs until the user calls a
>> specific "output operation". This is very important to remember. You can go
>> from a DataFrame to an RDD via its rdd method. You can go from an RDD to a
>> DataFrame (if the RDD is in a tabular format) via the toDF method.
>>
>> In general it is recommended to use a DataFrame where possible due to the
>> built in query optimization.
>>
>> For those familiar with SQL a DataFrame can be conveniently registered as a
>> temporary table and SQL operations can be performed on it.
>>
>> Case in point I am looking for all my replication server log files
>> compressed and stored in an HDFS directory for error on a specific connection
>>
>> //create an RDD
>> val rdd = sc.textFile("/test/REP_LOG.gz")
>> //convert it to Data Frame
>> val df = rdd.toDF("line")
>> //register the line as a temporary table
>> df.registerTempTable("t")
>> println("\n Search for ERROR plus another word in table t\n")
>> sql("select * from t WHERE line like '%ERROR%' and line like
>> '%hiveserver2.asehadoop%'").collect().foreach(println)
>>
>> Alternatively you can just use method calls on the DataFrame itself to
>> filter out the word
>>
>> df.filter(col("line").like("%ERROR%")).collect.foreach(println)
>>
>> HTH,
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>
>>
>> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>>
>> NOTE: The information in this email is proprietary and confidential. This
>> message is for the designated recipient only, if you are not the intended
>> recipient, you should destroy it immediately. Any information in this
>> message shall not be understood as given or endorsed by Peridale Technology
>> Ltd, its subsidiaries or their employees, unless expressly so stated. It is
>> the responsibility of the recipient to ensure that this email is virus free,
>> therefore neither Peridale Technology Ltd, its subsidiaries nor their
>> employees accept any responsibility.
>>
>>
>> From: Ashok Kumar [mailto:ashok34...@yahoo.com.INVALID
>> <mailto:ashok34...@yahoo.com.INVALID>]
>> Sent: 16 February 2016 16:06
>> To: User <user@spark.apache.org <mailto:user@spark.apache.org>>
>> Subject: Use case for RDD and Data Frame
>>
>> Gurus,
>>
>> What are the main differences between a Resilient Distributed Data (RDD) and
>> Data Frame (DF)
>>
>> Where one can use RDD without transforming it to DF?
>>
>> Regards and obliged