Re: SparkSQL DAG generation , DAG optimization , DAG execution

Rabin Banerjee Sat, 10 Sep 2016 02:00:07 -0700

Hi ,


   1. You are doing some analytics I guess?  *YES*
   2. It is almost impossible to guess what is happening except that you
   are looping 50 times over the same set of sql?  *I am Not Looping any
   SQL, All SQLs are called exactly once , which requires output from prev
   SQL.*
   3. Your sql step n depends on step n-1. So spark cannot get rid of 1
   ----- n steps, *TRUE, But If I have N SQL and all i th SQL is dependent
   upon i-1 , how spark optimize the memory ? is it like for each i th sql it
   will start execution from STAGE 0 ????*
   4. you are not storing anything in  memory(no cache, no persist), so all
   memory is used for the execution , IF Spark is not storing anything in
   memory , then when it is executing *i th sql it will start execution
   from STAGE 0 i.e starting from file read ????*
   5. What happens when you run it only once? How much memory is used (look
   at UI page, 4040 by default) , ? *I checked Spark UI DAG , so many file
   reads , Why ?*
   6.  What Spark mode is being used (Local, Standalone, Yarn) ? *Yarn*
   7. OOM could be anything depending on how much you are allocating to
   your driver memory in spark-submit ? *Driver and executor memory is set
   as 4gb , input data size is less than 1 GB, NO of executor is 5.*

*I am still bit confused about spark's execution plan on multiple SQL with
only one action .*

*Is it executing each SQL separately and trying to store intermediate
result in memory which is causing OOM/GC overhead ?*
*And Still my questions are ...*

*1. Will Spark optimize multiple SQL queries into one single plysical plan
Which will at least will not execute same stage twice , read file once... ?*
*2. In DAG I can see a lot of file read and lot of stages , Why ? I only
called action once ? Why in multiple stage Spark is again starting from
file reading ?*
*3. Is every SQL will execute and its intermediate result will be stored in
memory ?*
*4. What is something that causing OOM and GC overhead here ?*
*5. What is optimization that could be taken care of ? *


On Sat, Sep 10, 2016 at 11:35 AM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> Hi
>
>    1. You are doing some analytics I guess?
>    2. It is almost impossible to guess what is happening except that you
>    are looping 50 times over the same set of sql?
>    3. Your sql step n depends on step n-1. So spark cannot get rid of 1
>    -n steps
>    4. you are not storing anything in  memory(no cache, no persist), so
>    all memory is used for the execution
>    5. What happens when you run it only once? How much memory is used
>    (look at UI page, 4040 by default)
>    6.  What Spark mode is being used (Local, Standalone, Yarn)
>    7. OOM could be anything depending on how much you are allocating to
>    your driver memory in spark-submit
>
> HTH
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 September 2016 at 06:21, Rabin Banerjee <
> dev.rabin.baner...@gmail.com> wrote:
>
>> HI All,
>>
>>  I am writing and executing a Spark Batch program which only use
>> SPARK-SQL , But it is taking lot of time and finally giving GC overhead .
>>
>> Here is the program ,
>>
>> 1.Read 3 files ,one medium size and 2 small files, and register them as
>> DF.
>> 2.
>>      fire sql with complex aggregation and windowing .
>>      register result as DF.
>>
>> 3.  .........Repeat step 2 almost 50 times .so 50 sql .
>>
>> 4. All SQLs are sequential , i.e next step requires prev step result .
>>
>> 5. Finally save the final DF .(This is the only action called).
>>
>> Note ::
>>
>> 1. I haven't persists the intermediate DF , as I think Spark will
>> optimize multiple SQL into one physical execution plan .
>> 2. Executor memory and Driver memory is set as 4gb which is too high as
>> data size is in MB.
>>
>> Questions ::
>>
>> 1. Will Spark optimize multiple SQL queries into one single plysical plan
>> ?
>> 2. In DAG I can see a lot of file read and lot of stages , Why ? I only
>> called action once ?
>> 3. Is every SQL will execute and its intermediate result will be stored
>> in memory ?
>> 4. What is something that causing OOM and GC overhead here ?
>> 5. What is optimization that could be taken care of ?
>>
>> Spark Version 1.5.x
>>
>>
>> Thanks in advance .
>> Rabin
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Re: SparkSQL DAG generation , DAG optimization , DAG execution

Reply via email to