Re: How to optimize group by query fired using hiveContext.sql?

2015-10-05 Thread Umesh Kacha
Hi thanks I usually get see the following errors in Spark logs and because
of that I think executor gets lost all of the following happens because
huge data shuffle and I cant avoid that dont know what to do please guide

15/08/16 12:26:46 WARN spark.HeartbeatReceiver: Removing executor 10
with no recent heartbeats:

1051638 ms exceeds timeout 100 ms

Or

org.apache.spark.shuffle.MetadataFetchFailedException: Missing an
output location for shuffle 0
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:384)
at 
org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:381)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:380)
at 
org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:176)
at 
org.apache.spark.shuffle.hash.BlockStoreShuffleFetcher$.fetch(BlockStoreShuffleFetcher.scala:42)
at 
org.apache.spark.shuffle.hash.HashShuffleReader.read(HashShuffleReader.scala:40)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:92)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:230)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)



OR YARN kills container because of

Container [pid=26783,containerID=container_1389136889967_0009_01_02]
is running beyond physical memory limits. Current usage: 30.2 GB of 30
GB physical memory used; Killing container.


On Mon, Oct 5, 2015 at 8:00 AM, Alex Rovner 
wrote:

> Can you at least copy paste the error(s) you are seeing when the job
> fails? Without the error message(s), it's hard to even suggest anything.
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * *
>
> On Sat, Oct 3, 2015 at 9:50 AM, Umesh Kacha  wrote:
>
>> Hi thanks I cant share yarn logs because of privacy in my company but I
>> can tell you I have seen yarn logs there I have not found anything except
>> YARN killing container because it is exceeds physical memory capacity.
>>
>> I am using the following command line script Above job launches around
>> 1500 ExecutorService threads from a driver with a thread pool of 15 so at a
>> time 15 jobs will be running as showing in UI.
>>
>> ./spark-submit --class com.xyz.abc.MySparkJob
>>
>> --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" -
>>
>> -driver-java-options -XX:MaxPermSize=512m -
>>
>> -driver-memory 4g --master yarn-client
>>
>> --executor-memory 27G --executor-cores 2
>>
>> --num-executors 40
>>
>> --jars /path/to/others-jars
>>
>> /path/to/spark-job.jar
>>
>>
>> On Sat, Oct 3, 2015 at 7:11 PM, Alex Rovner 
>> wrote:
>>
>>> Can you send over your yarn logs along with the command you are using to
>>> submit your job?
>>>
>>> *Alex Rovner*
>>> *Director, Data Engineering *
>>> *o:* 646.759.0052
>>>
>>> * *
>>>
>>> On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha 
>>> wrote:
>>>
 Hi Alex thanks much for the reply. Please read the following for more
 details about my problem.


 http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn

 My each container has 8 core and 30 GB max memory. So I 

Re: How to optimize group by query fired using hiveContext.sql?

2015-10-04 Thread Alex Rovner
Can you at least copy paste the error(s) you are seeing when the job fails?
Without the error message(s), it's hard to even suggest anything.

*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* *

On Sat, Oct 3, 2015 at 9:50 AM, Umesh Kacha  wrote:

> Hi thanks I cant share yarn logs because of privacy in my company but I
> can tell you I have seen yarn logs there I have not found anything except
> YARN killing container because it is exceeds physical memory capacity.
>
> I am using the following command line script Above job launches around
> 1500 ExecutorService threads from a driver with a thread pool of 15 so at a
> time 15 jobs will be running as showing in UI.
>
> ./spark-submit --class com.xyz.abc.MySparkJob
>
> --conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" -
>
> -driver-java-options -XX:MaxPermSize=512m -
>
> -driver-memory 4g --master yarn-client
>
> --executor-memory 27G --executor-cores 2
>
> --num-executors 40
>
> --jars /path/to/others-jars
>
> /path/to/spark-job.jar
>
>
> On Sat, Oct 3, 2015 at 7:11 PM, Alex Rovner 
> wrote:
>
>> Can you send over your yarn logs along with the command you are using to
>> submit your job?
>>
>> *Alex Rovner*
>> *Director, Data Engineering *
>> *o:* 646.759.0052
>>
>> * *
>>
>> On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha 
>> wrote:
>>
>>> Hi Alex thanks much for the reply. Please read the following for more
>>> details about my problem.
>>>
>>>
>>> http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn
>>>
>>> My each container has 8 core and 30 GB max memory. So I am using
>>> yarn-client mode using 40 executors with 27GB/2 cores. If I use more cores
>>> then my job start loosing more executors. I tried to set
>>> spark.yarn.executor.memoryOverhead around 2 GB even 8 GB but it does
>>> not help I loose executors no matter what. The reason is my jobs shuffles
>>> lots of data even 20 GB of data in every job in UI I have seen it. Shuffle
>>> happens because of group by and I cant avoid it in my case.
>>>
>>>
>>>
>>> On Sat, Oct 3, 2015 at 6:27 PM, Alex Rovner 
>>> wrote:
>>>
 This sounds like you need to increase YARN overhead settings with the 
 "spark.yarn.executor.memoryOverhead"
 parameter. See http://spark.apache.org/docs/latest/running-on-yarn.html
 for more information on the setting.

 If that does not work for you, please provide the error messages and
 the command line you are using to submit your jobs for further
 troubleshooting.


 *Alex Rovner*
 *Director, Data Engineering *
 *o:* 646.759.0052

 * *

 On Sat, Oct 3, 2015 at 6:19 AM, unk1102  wrote:

> Hi I have couple of Spark jobs which uses group by query which is
> getting
> fired from hiveContext.sql() Now I know group by is evil but my use
> case I
> cant avoid group by I have around 7-8 fields on which I need to do
> group by.
> Also I am using df1.except(df2) which also seems heavy operation and
> does
> lots of shuffling please see my UI snap
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg
> >
>
> I have tried almost all optimisation including Spark 1.5 but nothing
> seems
> to be working and my job fails hangs because of executor will reach
> physical
> memory limit and YARN will kill it. I have around 1TB of data to
> process and
> it is skewed. Please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

>>>
>>
>


How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread unk1102
Hi I have couple of Spark jobs which uses group by query which is getting
fired from hiveContext.sql() Now I know group by is evil but my use case I
cant avoid group by I have around 7-8 fields on which I need to do group by.
Also I am using df1.except(df2) which also seems heavy operation and does
lots of shuffling please see my UI snap

 

I have tried almost all optimisation including Spark 1.5 but nothing seems
to be working and my job fails hangs because of executor will reach physical
memory limit and YARN will kill it. I have around 1TB of data to process and
it is skewed. Please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Alex Rovner
This sounds like you need to increase YARN overhead settings with the
"spark.yarn.executor.memoryOverhead"
parameter. See http://spark.apache.org/docs/latest/running-on-yarn.html for
more information on the setting.

If that does not work for you, please provide the error messages and the
command line you are using to submit your jobs for further troubleshooting.


*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* *

On Sat, Oct 3, 2015 at 6:19 AM, unk1102  wrote:

> Hi I have couple of Spark jobs which uses group by query which is getting
> fired from hiveContext.sql() Now I know group by is evil but my use case I
> cant avoid group by I have around 7-8 fields on which I need to do group
> by.
> Also I am using df1.except(df2) which also seems heavy operation and does
> lots of shuffling please see my UI snap
> <
> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg
> >
>
> I have tried almost all optimisation including Spark 1.5 but nothing seems
> to be working and my job fails hangs because of executor will reach
> physical
> memory limit and YARN will kill it. I have around 1TB of data to process
> and
> it is skewed. Please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Umesh Kacha
Hi Alex thanks much for the reply. Please read the following for more
details about my problem.

http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn

My each container has 8 core and 30 GB max memory. So I am using
yarn-client mode using 40 executors with 27GB/2 cores. If I use more cores
then my job start loosing more executors. I tried to set
spark.yarn.executor.memoryOverhead around 2 GB even 8 GB but it does not
help I loose executors no matter what. The reason is my jobs shuffles lots
of data even 20 GB of data in every job in UI I have seen it. Shuffle
happens because of group by and I cant avoid it in my case.



On Sat, Oct 3, 2015 at 6:27 PM, Alex Rovner 
wrote:

> This sounds like you need to increase YARN overhead settings with the 
> "spark.yarn.executor.memoryOverhead"
> parameter. See http://spark.apache.org/docs/latest/running-on-yarn.html
> for more information on the setting.
>
> If that does not work for you, please provide the error messages and the
> command line you are using to submit your jobs for further troubleshooting.
>
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * *
>
> On Sat, Oct 3, 2015 at 6:19 AM, unk1102  wrote:
>
>> Hi I have couple of Spark jobs which uses group by query which is getting
>> fired from hiveContext.sql() Now I know group by is evil but my use case I
>> cant avoid group by I have around 7-8 fields on which I need to do group
>> by.
>> Also I am using df1.except(df2) which also seems heavy operation and does
>> lots of shuffling please see my UI snap
>> <
>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg
>> >
>>
>> I have tried almost all optimisation including Spark 1.5 but nothing seems
>> to be working and my job fails hangs because of executor will reach
>> physical
>> memory limit and YARN will kill it. I have around 1TB of data to process
>> and
>> it is skewed. Please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Alex Rovner
Can you send over your yarn logs along with the command you are using to
submit your job?

*Alex Rovner*
*Director, Data Engineering *
*o:* 646.759.0052

* *

On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha  wrote:

> Hi Alex thanks much for the reply. Please read the following for more
> details about my problem.
>
>
> http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn
>
> My each container has 8 core and 30 GB max memory. So I am using
> yarn-client mode using 40 executors with 27GB/2 cores. If I use more cores
> then my job start loosing more executors. I tried to set
> spark.yarn.executor.memoryOverhead around 2 GB even 8 GB but it does not
> help I loose executors no matter what. The reason is my jobs shuffles lots
> of data even 20 GB of data in every job in UI I have seen it. Shuffle
> happens because of group by and I cant avoid it in my case.
>
>
>
> On Sat, Oct 3, 2015 at 6:27 PM, Alex Rovner 
> wrote:
>
>> This sounds like you need to increase YARN overhead settings with the 
>> "spark.yarn.executor.memoryOverhead"
>> parameter. See http://spark.apache.org/docs/latest/running-on-yarn.html
>> for more information on the setting.
>>
>> If that does not work for you, please provide the error messages and the
>> command line you are using to submit your jobs for further troubleshooting.
>>
>>
>> *Alex Rovner*
>> *Director, Data Engineering *
>> *o:* 646.759.0052
>>
>> * *
>>
>> On Sat, Oct 3, 2015 at 6:19 AM, unk1102  wrote:
>>
>>> Hi I have couple of Spark jobs which uses group by query which is getting
>>> fired from hiveContext.sql() Now I know group by is evil but my use case
>>> I
>>> cant avoid group by I have around 7-8 fields on which I need to do group
>>> by.
>>> Also I am using df1.except(df2) which also seems heavy operation and does
>>> lots of shuffling please see my UI snap
>>> <
>>> http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg
>>> >
>>>
>>> I have tried almost all optimisation including Spark 1.5 but nothing
>>> seems
>>> to be working and my job fails hangs because of executor will reach
>>> physical
>>> memory limit and YARN will kill it. I have around 1TB of data to process
>>> and
>>> it is skewed. Please guide.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: How to optimize group by query fired using hiveContext.sql?

2015-10-03 Thread Umesh Kacha
Hi thanks I cant share yarn logs because of privacy in my company but I can
tell you I have seen yarn logs there I have not found anything except YARN
killing container because it is exceeds physical memory capacity.

I am using the following command line script Above job launches around 1500
ExecutorService threads from a driver with a thread pool of 15 so at a time
15 jobs will be running as showing in UI.

./spark-submit --class com.xyz.abc.MySparkJob

--conf "spark.executor.extraJavaOptions=-XX:MaxPermSize=512M" -

-driver-java-options -XX:MaxPermSize=512m -

-driver-memory 4g --master yarn-client

--executor-memory 27G --executor-cores 2

--num-executors 40

--jars /path/to/others-jars

/path/to/spark-job.jar


On Sat, Oct 3, 2015 at 7:11 PM, Alex Rovner 
wrote:

> Can you send over your yarn logs along with the command you are using to
> submit your job?
>
> *Alex Rovner*
> *Director, Data Engineering *
> *o:* 646.759.0052
>
> * *
>
> On Sat, Oct 3, 2015 at 9:07 AM, Umesh Kacha  wrote:
>
>> Hi Alex thanks much for the reply. Please read the following for more
>> details about my problem.
>>
>>
>> http://stackoverflow.com/questions/32317285/spark-executor-oom-issue-on-yarn
>>
>> My each container has 8 core and 30 GB max memory. So I am using
>> yarn-client mode using 40 executors with 27GB/2 cores. If I use more cores
>> then my job start loosing more executors. I tried to set
>> spark.yarn.executor.memoryOverhead around 2 GB even 8 GB but it does not
>> help I loose executors no matter what. The reason is my jobs shuffles lots
>> of data even 20 GB of data in every job in UI I have seen it. Shuffle
>> happens because of group by and I cant avoid it in my case.
>>
>>
>>
>> On Sat, Oct 3, 2015 at 6:27 PM, Alex Rovner 
>> wrote:
>>
>>> This sounds like you need to increase YARN overhead settings with the 
>>> "spark.yarn.executor.memoryOverhead"
>>> parameter. See http://spark.apache.org/docs/latest/running-on-yarn.html
>>> for more information on the setting.
>>>
>>> If that does not work for you, please provide the error messages and the
>>> command line you are using to submit your jobs for further troubleshooting.
>>>
>>>
>>> *Alex Rovner*
>>> *Director, Data Engineering *
>>> *o:* 646.759.0052
>>>
>>> * *
>>>
>>> On Sat, Oct 3, 2015 at 6:19 AM, unk1102  wrote:
>>>
 Hi I have couple of Spark jobs which uses group by query which is
 getting
 fired from hiveContext.sql() Now I know group by is evil but my use
 case I
 cant avoid group by I have around 7-8 fields on which I need to do
 group by.
 Also I am using df1.except(df2) which also seems heavy operation and
 does
 lots of shuffling please see my UI snap
 <
 http://apache-spark-user-list.1001560.n3.nabble.com/file/n24914/IMG_20151003_151830218.jpg
 >

 I have tried almost all optimisation including Spark 1.5 but nothing
 seems
 to be working and my job fails hangs because of executor will reach
 physical
 memory limit and YARN will kill it. I have around 1TB of data to
 process and
 it is skewed. Please guide.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-optimize-group-by-query-fired-using-hiveContext-sql-tp24914.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>>
>>
>