Hi ,
1. You are doing some analytics I guess? *YES* 2. It is almost impossible to guess what is happening except that you are looping 50 times over the same set of sql? *I am Not Looping any SQL, All SQLs are called exactly once , which requires output from prev SQL.* 3. Your sql step n depends on step n-1. So spark cannot get rid of 1 ----- n steps, *TRUE, But If I have N SQL and all i th SQL is dependent upon i-1 , how spark optimize the memory ? is it like for each i th sql it will start execution from STAGE 0 ????* 4. you are not storing anything in memory(no cache, no persist), so all memory is used for the execution , IF Spark is not storing anything in memory , then when it is executing *i th sql it will start execution from STAGE 0 i.e starting from file read ????* 5. What happens when you run it only once? How much memory is used (look at UI page, 4040 by default) , ? *I checked Spark UI DAG , so many file reads , Why ?* 6. What Spark mode is being used (Local, Standalone, Yarn) ? *Yarn* 7. OOM could be anything depending on how much you are allocating to your driver memory in spark-submit ? *Driver and executor memory is set as 4gb , input data size is less than 1 GB, NO of executor is 5.* *I am still bit confused about spark's execution plan on multiple SQL with only one action .* *Is it executing each SQL separately and trying to store intermediate result in memory which is causing OOM/GC overhead ?* *And Still my questions are ...* *1. Will Spark optimize multiple SQL queries into one single plysical plan Which will at least will not execute same stage twice , read file once... ?* *2. In DAG I can see a lot of file read and lot of stages , Why ? I only called action once ? Why in multiple stage Spark is again starting from file reading ?* *3. Is every SQL will execute and its intermediate result will be stored in memory ?* *4. What is something that causing OOM and GC overhead here ?* *5. What is optimization that could be taken care of ? * On Sat, Sep 10, 2016 at 11:35 AM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > Hi > > 1. You are doing some analytics I guess? > 2. It is almost impossible to guess what is happening except that you > are looping 50 times over the same set of sql? > 3. Your sql step n depends on step n-1. So spark cannot get rid of 1 > -n steps > 4. you are not storing anything in memory(no cache, no persist), so > all memory is used for the execution > 5. What happens when you run it only once? How much memory is used > (look at UI page, 4040 by default) > 6. What Spark mode is being used (Local, Standalone, Yarn) > 7. OOM could be anything depending on how much you are allocating to > your driver memory in spark-submit > > HTH > > > > > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 10 September 2016 at 06:21, Rabin Banerjee < > dev.rabin.baner...@gmail.com> wrote: > >> HI All, >> >> I am writing and executing a Spark Batch program which only use >> SPARK-SQL , But it is taking lot of time and finally giving GC overhead . >> >> Here is the program , >> >> 1.Read 3 files ,one medium size and 2 small files, and register them as >> DF. >> 2. >> fire sql with complex aggregation and windowing . >> register result as DF. >> >> 3. .........Repeat step 2 almost 50 times .so 50 sql . >> >> 4. All SQLs are sequential , i.e next step requires prev step result . >> >> 5. Finally save the final DF .(This is the only action called). >> >> Note :: >> >> 1. I haven't persists the intermediate DF , as I think Spark will >> optimize multiple SQL into one physical execution plan . >> 2. Executor memory and Driver memory is set as 4gb which is too high as >> data size is in MB. >> >> Questions :: >> >> 1. Will Spark optimize multiple SQL queries into one single plysical plan >> ? >> 2. In DAG I can see a lot of file read and lot of stages , Why ? I only >> called action once ? >> 3. Is every SQL will execute and its intermediate result will be stored >> in memory ? >> 4. What is something that causing OOM and GC overhead here ? >> 5. What is optimization that could be taken care of ? >> >> Spark Version 1.5.x >> >> >> Thanks in advance . >> Rabin >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >