Re: Question on Spark architecture and DAG

Andy Davidson Fri, 12 Feb 2016 13:17:48 -0800


From:  Mich Talebzadeh <m...@peridale.co.uk>
Date:  Thursday, February 11, 2016 at 2:30 PM
To:  "user @spark" <user@spark.apache.org>
Subject:  Question on Spark architecture and DAG


> Hi,
> 
> I have used Hive on Spark engine and of course Hive tables and its pretty
> impressive comparing Hive using MR engine.
> 
>  
> 
> Let us assume that I use spark shell. Spark shell is a client that connects to
> spark master running on a host and port like below
> 
> spark-shell --master spark://50.140.197.217:7077:
> 
> Ok once I connect I create an RDD to read a text file:
> 
> val oralog = sc.textFile("/test/alert_mydb.log")
> 
> I then search for word Errors in that file
> 
> oralog.filter(line => line.contains("Errors")).collect().foreach(line =>
> println(line))
> 
>  
> 
> Questions:
> 
>  
> 1. In order to display the lines (the result set) containing word "Errors",
> the content of the file (i.e. the blocks on HDFS) need to be read into memory.
> Is my understanding correct that as per RDD notes those blocks from the file
> will be partitioned across the cluster and each node will have its share of
> blocks in memory?


Typically results are written to disk. For example look at
rdd.saveAsTextFile(). You can also use ³collect² to copy the RDD data into
the drivers local memory. You need to be careful that all the data will fit
in memory.

> 1. 
> 2. Once the result is returned back they need to be sent to the client that
> has made the connection to master. I guess this is a simple TCP operation much
> like any relational database sending the result back?


I run several spark streaming apps. One collects data, does some clean up
and publishes the results to down stream systems using activeMQ. Some of our
other apps just write on a socket

> 1. 
> 2. Once the results are returned if no request has been made to keep the data
> in memory, those blocks in memory will be discarded?

There are couple of thing to consider, for example if your batch job
completes all memory is returned. Programaticaly you make RDD persistent or
cause them to be cached in memory

> 1. 
> 2. Regardless of the storage block size on disk (128MB, 256MB etc), the memory
> pages are 2K in relational databases? Is this the case in Spark as well?
> Thanks,
> 
>  Mich Talebzadeh
> 
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8
> Pw 
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV
> 8Pw> 
>  
> http://talebzadehmich.wordpress.com <http://talebzadehmich.wordpress.com/>
>  
> NOTE: The information in this email is proprietary and confidential. This
> message is for the designated recipient only, if you are not the intended
> recipient, you should destroy it immediately. Any information in this message
> shall not be understood as given or endorsed by Peridale Technology Ltd, its
> subsidiaries or their employees, unless expressly so stated. It is the
> responsibility of the recipient to ensure that this email is virus free,
> therefore neither Peridale Technology Ltd, its subsidiaries nor their
> employees accept any responsibility.
>  
>

Re: Question on Spark architecture and DAG

Reply via email to