Re: SparkSQL DAG generation , DAG optimization , DAG execution

Mich Talebzadeh Sat, 10 Sep 2016 02:36:08 -0700

right let us simplify this.

can you run the whole thing *once* only and send dag execution output from
UI?


you can use snipping tool to take the image.

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 10 September 2016 at 09:59, Rabin Banerjee <dev.rabin.baner...@gmail.com>
wrote:

> Hi ,
>
>
>    1. You are doing some analytics I guess?  *YES*
>    2. It is almost impossible to guess what is happening except that you
>    are looping 50 times over the same set of sql?  *I am Not Looping any
>    SQL, All SQLs are called exactly once , which requires output from prev
>    SQL.*
>    3. Your sql step n depends on step n-1. So spark cannot get rid of 1
>    ----- n steps, *TRUE, But If I have N SQL and all i th SQL is
>    dependent upon i-1 , how spark optimize the memory ? is it like for each i
>    th sql it will start execution from STAGE 0 ????*
>    4. you are not storing anything in  memory(no cache, no persist), so
>    all memory is used for the execution , IF Spark is not storing anything in
>    memory , then when it is executing *i th sql it will start execution
>    from STAGE 0 i.e starting from file read ????*
>    5. What happens when you run it only once? How much memory is used
>    (look at UI page, 4040 by default) , ? *I checked Spark UI DAG , so
>    many file reads , Why ?*
>    6.  What Spark mode is being used (Local, Standalone, Yarn) ? *Yarn*
>    7. OOM could be anything depending on how much you are allocating to
>    your driver memory in spark-submit ? *Driver and executor memory is
>    set as 4gb , input data size is less than 1 GB, NO of executor is 5.*
>
> *I am still bit confused about spark's execution plan on multiple SQL with
> only one action .*
>
> *Is it executing each SQL separately and trying to store intermediate
> result in memory which is causing OOM/GC overhead ?*
> *And Still my questions are ...*
>
> *1. Will Spark optimize multiple SQL queries into one single plysical plan
> Which will at least will not execute same stage twice , read file once... ?*
> *2. In DAG I can see a lot of file read and lot of stages , Why ? I only
> called action once ? Why in multiple stage Spark is again starting from
> file reading ?*
> *3. Is every SQL will execute and its intermediate result will be stored
> in memory ?*
> *4. What is something that causing OOM and GC overhead here ?*
> *5. What is optimization that could be taken care of ? *
>
>
> On Sat, Sep 10, 2016 at 11:35 AM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Hi
>>
>>    1. You are doing some analytics I guess?
>>    2. It is almost impossible to guess what is happening except that you
>>    are looping 50 times over the same set of sql?
>>    3. Your sql step n depends on step n-1. So spark cannot get rid of 1
>>    -n steps
>>    4. you are not storing anything in  memory(no cache, no persist), so
>>    all memory is used for the execution
>>    5. What happens when you run it only once? How much memory is used
>>    (look at UI page, 4040 by default)
>>    6.  What Spark mode is being used (Local, Standalone, Yarn)
>>    7. OOM could be anything depending on how much you are allocating to
>>    your driver memory in spark-submit
>>
>> HTH
>>
>>
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 10 September 2016 at 06:21, Rabin Banerjee <
>> dev.rabin.baner...@gmail.com> wrote:
>>
>>> HI All,
>>>
>>>  I am writing and executing a Spark Batch program which only use
>>> SPARK-SQL , But it is taking lot of time and finally giving GC overhead .
>>>
>>> Here is the program ,
>>>
>>> 1.Read 3 files ,one medium size and 2 small files, and register them as
>>> DF.
>>> 2.
>>>      fire sql with complex aggregation and windowing .
>>>      register result as DF.
>>>
>>> 3.  .........Repeat step 2 almost 50 times .so 50 sql .
>>>
>>> 4. All SQLs are sequential , i.e next step requires prev step result .
>>>
>>> 5. Finally save the final DF .(This is the only action called).
>>>
>>> Note ::
>>>
>>> 1. I haven't persists the intermediate DF , as I think Spark will
>>> optimize multiple SQL into one physical execution plan .
>>> 2. Executor memory and Driver memory is set as 4gb which is too high as
>>> data size is in MB.
>>>
>>> Questions ::
>>>
>>> 1. Will Spark optimize multiple SQL queries into one single plysical
>>> plan ?
>>> 2. In DAG I can see a lot of file read and lot of stages , Why ? I only
>>> called action once ?
>>> 3. Is every SQL will execute and its intermediate result will be stored
>>> in memory ?
>>> 4. What is something that causing OOM and GC overhead here ?
>>> 5. What is optimization that could be taken care of ?
>>>
>>> Spark Version 1.5.x
>>>
>>>
>>> Thanks in advance .
>>> Rabin
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

Re: SparkSQL DAG generation , DAG optimization , DAG execution

Reply via email to