Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Nirav Patel
Yes my mistake. I am using Spark 1.5.2 not 2.x.

I looked at running spark driver jvm process on linux. Looks like my
settings are not being applied to driver. We use oozie spark action to
launch spark. I will have to investigate more on that.

hopefully spark is or have replaced memory killer Java serializer to better
streaming serializer.

Thanks

On Sun, May 8, 2016 at 9:33 AM, Ted Yu  wrote:

> See the following:
> [SPARK-7997][CORE] Remove Akka from Spark Core and Streaming
>
> I guess you meant you are using Spark 1.5.1
>
> For the time being, consider increasing spark.driver.memory
>
> Cheers
>
> On Sun, May 8, 2016 at 9:14 AM, Nirav Patel  wrote:
>
>> Yes, I am using yarn client mode hence I specified am settings too.
>> What you mean akka is moved out of picture? I am using spark 2.5.1
>>
>> Sent from my iPhone
>>
>> On May 8, 2016, at 6:39 AM, Ted Yu  wrote:
>>
>> Are you using YARN client mode ?
>>
>> See
>> https://spark.apache.org/docs/latest/running-on-yarn.html
>>
>> In cluster mode, spark.yarn.am.memory is not effective.
>>
>> For Spark 2.0, akka is moved out of the picture.
>> FYI
>>
>> On Sat, May 7, 2016 at 8:24 PM, Nirav Patel 
>> wrote:
>>
>>> I have 20 executors, 6 cores each. Total 5 stages. It fails on 5th one.
>>> All of them have 6474 tasks. 5th task is a count operations and it also
>>> performs aggregateByKey as a part of it lazy evaluation.
>>> I am setting:
>>> spark.driver.memory=10G, spark.yarn.am.memory=2G and
>>> spark.driver.maxResultSize=9G
>>>
>>>
>>> On a side note, could it be something to do with java serialization
>>> library, ByteArrayOutputStream using byte array? Can it be replaced by
>>> some better serializing library?
>>>
>>> https://bugs.openjdk.java.net/browse/JDK-8055949
>>> https://bugs.openjdk.java.net/browse/JDK-8136527
>>>
>>> Thanks
>>>
>>> On Sat, May 7, 2016 at 4:51 PM, Ashish Dubey 
>>> wrote:
>>>
 Driver maintains the complete metadata of application ( scheduling of
 executor and maintaining the messaging to control the execution )
 This code seems to be failing in that code path only. With that said
 there is Jvm overhead based on num of executors , stages and tasks in your
 app. Do you know your driver heap size and application structure ( num of
 stages and tasks )

 Ashish

 On Saturday, May 7, 2016, Nirav Patel  wrote:

> Right but this logs from spark driver and spark driver seems to use
> Akka.
>
> ERROR [sparkDriver-akka.actor.default-dispatcher-17]
> akka.actor.ActorSystemImpl: Uncaught fatal error from thread
> [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down
> ActorSystem [sparkDriver]
>
> I saw following logs before above happened.
>
> 2016-05-06 09:49:17,813 INFO
> [sparkDriver-akka.actor.default-dispatcher-17]
> org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output
> locations for shuffle 1 to hdn6.xactlycorporation.local:44503
>
>
> As far as I know driver is just driving shuffle operation but not
> actually doing anything within its own system that will cause memory 
> issue.
> Can you explain in what circumstances I could see this error in driver
> logs? I don't do any collect or any other driver operation that would 
> cause
> this. It fails when doing aggregateByKey operation but that should happen
> in executor JVM NOT in driver JVM.
>
>
> Thanks
>
> On Sat, May 7, 2016 at 11:58 AM, Ted Yu  wrote:
>
>> bq.   at akka.serialization.JavaSerializer.toBinary(
>> Serializer.scala:129)
>>
>> It was Akka which uses JavaSerializer
>>
>> Cheers
>>
>> On Sat, May 7, 2016 at 11:13 AM, Nirav Patel 
>> wrote:
>>
>>> Hi,
>>>
>>> I thought I was using kryo serializer for shuffle.  I could verify
>>> it from spark UI - Environment tab that
>>> spark.serializer org.apache.spark.serializer.KryoSerializer
>>> spark.kryo.registrator
>>> com.myapp.spark.jobs.conf.SparkSerializerRegistrator
>>>
>>>
>>> But when I see following error in Driver logs it looks like spark is
>>> using JavaSerializer
>>>
>>> 2016-05-06 09:49:26,490 ERROR
>>> [sparkDriver-akka.actor.default-dispatcher-17] 
>>> akka.actor.ActorSystemImpl:
>>> Uncaught fatal error from thread
>>> [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
>>> ActorSystem [sparkDriver]
>>>
>>> java.lang.OutOfMemoryError: Java heap space
>>>
>>> at java.util.Arrays.copyOf(Arrays.java:2271)
>>>
>>> at
>>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>>
>>> at
>>> 

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Ted Yu
See the following:
[SPARK-7997][CORE] Remove Akka from Spark Core and Streaming

I guess you meant you are using Spark 1.5.1

For the time being, consider increasing spark.driver.memory

Cheers

On Sun, May 8, 2016 at 9:14 AM, Nirav Patel  wrote:

> Yes, I am using yarn client mode hence I specified am settings too.
> What you mean akka is moved out of picture? I am using spark 2.5.1
>
> Sent from my iPhone
>
> On May 8, 2016, at 6:39 AM, Ted Yu  wrote:
>
> Are you using YARN client mode ?
>
> See
> https://spark.apache.org/docs/latest/running-on-yarn.html
>
> In cluster mode, spark.yarn.am.memory is not effective.
>
> For Spark 2.0, akka is moved out of the picture.
> FYI
>
> On Sat, May 7, 2016 at 8:24 PM, Nirav Patel  wrote:
>
>> I have 20 executors, 6 cores each. Total 5 stages. It fails on 5th one.
>> All of them have 6474 tasks. 5th task is a count operations and it also
>> performs aggregateByKey as a part of it lazy evaluation.
>> I am setting:
>> spark.driver.memory=10G, spark.yarn.am.memory=2G and
>> spark.driver.maxResultSize=9G
>>
>>
>> On a side note, could it be something to do with java serialization
>> library, ByteArrayOutputStream using byte array? Can it be replaced by
>> some better serializing library?
>>
>> https://bugs.openjdk.java.net/browse/JDK-8055949
>> https://bugs.openjdk.java.net/browse/JDK-8136527
>>
>> Thanks
>>
>> On Sat, May 7, 2016 at 4:51 PM, Ashish Dubey 
>> wrote:
>>
>>> Driver maintains the complete metadata of application ( scheduling of
>>> executor and maintaining the messaging to control the execution )
>>> This code seems to be failing in that code path only. With that said
>>> there is Jvm overhead based on num of executors , stages and tasks in your
>>> app. Do you know your driver heap size and application structure ( num of
>>> stages and tasks )
>>>
>>> Ashish
>>>
>>> On Saturday, May 7, 2016, Nirav Patel  wrote:
>>>
 Right but this logs from spark driver and spark driver seems to use
 Akka.

 ERROR [sparkDriver-akka.actor.default-dispatcher-17]
 akka.actor.ActorSystemImpl: Uncaught fatal error from thread
 [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down
 ActorSystem [sparkDriver]

 I saw following logs before above happened.

 2016-05-06 09:49:17,813 INFO
 [sparkDriver-akka.actor.default-dispatcher-17]
 org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output
 locations for shuffle 1 to hdn6.xactlycorporation.local:44503


 As far as I know driver is just driving shuffle operation but not
 actually doing anything within its own system that will cause memory issue.
 Can you explain in what circumstances I could see this error in driver
 logs? I don't do any collect or any other driver operation that would cause
 this. It fails when doing aggregateByKey operation but that should happen
 in executor JVM NOT in driver JVM.


 Thanks

 On Sat, May 7, 2016 at 11:58 AM, Ted Yu  wrote:

> bq.   at akka.serialization.JavaSerializer.toBinary(
> Serializer.scala:129)
>
> It was Akka which uses JavaSerializer
>
> Cheers
>
> On Sat, May 7, 2016 at 11:13 AM, Nirav Patel 
> wrote:
>
>> Hi,
>>
>> I thought I was using kryo serializer for shuffle.  I could verify it
>> from spark UI - Environment tab that
>> spark.serializer org.apache.spark.serializer.KryoSerializer
>> spark.kryo.registrator
>> com.myapp.spark.jobs.conf.SparkSerializerRegistrator
>>
>>
>> But when I see following error in Driver logs it looks like spark is
>> using JavaSerializer
>>
>> 2016-05-06 09:49:26,490 ERROR
>> [sparkDriver-akka.actor.default-dispatcher-17] 
>> akka.actor.ActorSystemImpl:
>> Uncaught fatal error from thread
>> [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
>> ActorSystem [sparkDriver]
>>
>> java.lang.OutOfMemoryError: Java heap space
>>
>> at java.util.Arrays.copyOf(Arrays.java:2271)
>>
>> at
>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>
>> at
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>>
>> at
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>
>> at
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>>
>> at
>> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>>
>> at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>>
>> at
>> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Nirav Patel
Yes, I am using yarn client mode hence I specified am settings too.
What you mean akka is moved out of picture? I am using spark 2.5.1 

Sent from my iPhone

> On May 8, 2016, at 6:39 AM, Ted Yu  wrote:
> 
> Are you using YARN client mode ?
> 
> See
> https://spark.apache.org/docs/latest/running-on-yarn.html
> 
> In cluster mode, spark.yarn.am.memory is not effective.
> 
> For Spark 2.0, akka is moved out of the picture.
> FYI
> 
>> On Sat, May 7, 2016 at 8:24 PM, Nirav Patel  wrote:
>> I have 20 executors, 6 cores each. Total 5 stages. It fails on 5th one. All 
>> of them have 6474 tasks. 5th task is a count operations and it also performs 
>> aggregateByKey as a part of it lazy evaluation. 
>> I am setting:
>> spark.driver.memory=10G, spark.yarn.am.memory=2G and 
>> spark.driver.maxResultSize=9G 
>> 
>> 
>> On a side note, could it be something to do with java serialization library, 
>> ByteArrayOutputStream using byte array? Can it be replaced by some better 
>> serializing library?
>> 
>> https://bugs.openjdk.java.net/browse/JDK-8055949
>> https://bugs.openjdk.java.net/browse/JDK-8136527
>> 
>> Thanks
>> 
>>> On Sat, May 7, 2016 at 4:51 PM, Ashish Dubey  wrote:
>>> Driver maintains the complete metadata of application ( scheduling of 
>>> executor and maintaining the messaging to control the execution )
>>> This code seems to be failing in that code path only. With that said there 
>>> is Jvm overhead based on num of executors , stages and tasks in your app. 
>>> Do you know your driver heap size and application structure ( num of stages 
>>> and tasks )
>>> 
>>> Ashish 
>>> 
 On Saturday, May 7, 2016, Nirav Patel  wrote:
 Right but this logs from spark driver and spark driver seems to use Akka.
 
 ERROR [sparkDriver-akka.actor.default-dispatcher-17] 
 akka.actor.ActorSystemImpl: Uncaught fatal error from thread 
 [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down 
 ActorSystem [sparkDriver]
 
 I saw following logs before above happened.
 
 2016-05-06 09:49:17,813 INFO 
 [sparkDriver-akka.actor.default-dispatcher-17] 
 org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output 
 locations for shuffle 1 to hdn6.xactlycorporation.local:44503
 
 
 
 As far as I know driver is just driving shuffle operation but not actually 
 doing anything within its own system that will cause memory issue. Can you 
 explain in what circumstances I could see this error in driver logs? I 
 don't do any collect or any other driver operation that would cause this. 
 It fails when doing aggregateByKey operation but that should happen in 
 executor JVM NOT in driver JVM.
 
 
 
 Thanks
 
 
> On Sat, May 7, 2016 at 11:58 AM, Ted Yu  wrote:
> bq.   at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
> 
> It was Akka which uses JavaSerializer
> 
> Cheers
> 
>> On Sat, May 7, 2016 at 11:13 AM, Nirav Patel  
>> wrote:
>> Hi,
>> 
>> I thought I was using kryo serializer for shuffle.  I could verify it 
>> from spark UI - Environment tab that 
>> spark.serializer org.apache.spark.serializer.KryoSerializer
>> spark.kryo.registrator   
>> com.myapp.spark.jobs.conf.SparkSerializerRegistrator
>> 
>> 
>> But when I see following error in Driver logs it looks like spark is 
>> using JavaSerializer 
>> 
>> 2016-05-06 09:49:26,490 ERROR 
>> [sparkDriver-akka.actor.default-dispatcher-17] 
>> akka.actor.ActorSystemImpl: Uncaught fatal error from thread 
>> [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down 
>> ActorSystem [sparkDriver]
>> 
>> java.lang.OutOfMemoryError: Java heap space
>> 
>> at java.util.Arrays.copyOf(Arrays.java:2271)
>> 
>> at 
>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>> 
>> at 
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>> 
>> at 
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>> 
>> at 
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>> 
>> at 
>> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>> 
>> at 
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>> 
>> at 
>> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>> 
>> at 
>> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
>> 
>> at 
>> 

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-08 Thread Ted Yu
Are you using YARN client mode ?

See
https://spark.apache.org/docs/latest/running-on-yarn.html

In cluster mode, spark.yarn.am.memory is not effective.

For Spark 2.0, akka is moved out of the picture.
FYI

On Sat, May 7, 2016 at 8:24 PM, Nirav Patel  wrote:

> I have 20 executors, 6 cores each. Total 5 stages. It fails on 5th one.
> All of them have 6474 tasks. 5th task is a count operations and it also
> performs aggregateByKey as a part of it lazy evaluation.
> I am setting:
> spark.driver.memory=10G, spark.yarn.am.memory=2G and
> spark.driver.maxResultSize=9G
>
>
> On a side note, could it be something to do with java serialization
> library, ByteArrayOutputStream using byte array? Can it be replaced by
> some better serializing library?
>
> https://bugs.openjdk.java.net/browse/JDK-8055949
> https://bugs.openjdk.java.net/browse/JDK-8136527
>
> Thanks
>
> On Sat, May 7, 2016 at 4:51 PM, Ashish Dubey  wrote:
>
>> Driver maintains the complete metadata of application ( scheduling of
>> executor and maintaining the messaging to control the execution )
>> This code seems to be failing in that code path only. With that said
>> there is Jvm overhead based on num of executors , stages and tasks in your
>> app. Do you know your driver heap size and application structure ( num of
>> stages and tasks )
>>
>> Ashish
>>
>> On Saturday, May 7, 2016, Nirav Patel  wrote:
>>
>>> Right but this logs from spark driver and spark driver seems to use Akka.
>>>
>>> ERROR [sparkDriver-akka.actor.default-dispatcher-17]
>>> akka.actor.ActorSystemImpl: Uncaught fatal error from thread
>>> [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down
>>> ActorSystem [sparkDriver]
>>>
>>> I saw following logs before above happened.
>>>
>>> 2016-05-06 09:49:17,813 INFO
>>> [sparkDriver-akka.actor.default-dispatcher-17]
>>> org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output
>>> locations for shuffle 1 to hdn6.xactlycorporation.local:44503
>>>
>>>
>>> As far as I know driver is just driving shuffle operation but not
>>> actually doing anything within its own system that will cause memory issue.
>>> Can you explain in what circumstances I could see this error in driver
>>> logs? I don't do any collect or any other driver operation that would cause
>>> this. It fails when doing aggregateByKey operation but that should happen
>>> in executor JVM NOT in driver JVM.
>>>
>>>
>>> Thanks
>>>
>>> On Sat, May 7, 2016 at 11:58 AM, Ted Yu  wrote:
>>>
 bq.   at akka.serialization.JavaSerializer.toBinary(
 Serializer.scala:129)

 It was Akka which uses JavaSerializer

 Cheers

 On Sat, May 7, 2016 at 11:13 AM, Nirav Patel 
 wrote:

> Hi,
>
> I thought I was using kryo serializer for shuffle.  I could verify it
> from spark UI - Environment tab that
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.kryo.registrator
> com.myapp.spark.jobs.conf.SparkSerializerRegistrator
>
>
> But when I see following error in Driver logs it looks like spark is
> using JavaSerializer
>
> 2016-05-06 09:49:26,490 ERROR
> [sparkDriver-akka.actor.default-dispatcher-17] akka.actor.ActorSystemImpl:
> Uncaught fatal error from thread
> [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
> ActorSystem [sparkDriver]
>
> java.lang.OutOfMemoryError: Java heap space
>
> at java.util.Arrays.copyOf(Arrays.java:2271)
>
> at
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>
> at
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>
> at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>
> at
> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
>
> at
> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
>
> at
> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
>
> at
> scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>
> at
> akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
>
> at
> akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
>
> at
> 

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
I have 20 executors, 6 cores each. Total 5 stages. It fails on 5th one. All
of them have 6474 tasks. 5th task is a count operations and it also
performs aggregateByKey as a part of it lazy evaluation.
I am setting:
spark.driver.memory=10G, spark.yarn.am.memory=2G and
spark.driver.maxResultSize=9G


On a side note, could it be something to do with java serialization
library, ByteArrayOutputStream using byte array? Can it be replaced by some
better serializing library?

https://bugs.openjdk.java.net/browse/JDK-8055949
https://bugs.openjdk.java.net/browse/JDK-8136527

Thanks

On Sat, May 7, 2016 at 4:51 PM, Ashish Dubey  wrote:

> Driver maintains the complete metadata of application ( scheduling of
> executor and maintaining the messaging to control the execution )
> This code seems to be failing in that code path only. With that said there
> is Jvm overhead based on num of executors , stages and tasks in your app.
> Do you know your driver heap size and application structure ( num of stages
> and tasks )
>
> Ashish
>
> On Saturday, May 7, 2016, Nirav Patel  wrote:
>
>> Right but this logs from spark driver and spark driver seems to use Akka.
>>
>> ERROR [sparkDriver-akka.actor.default-dispatcher-17]
>> akka.actor.ActorSystemImpl: Uncaught fatal error from thread
>> [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down
>> ActorSystem [sparkDriver]
>>
>> I saw following logs before above happened.
>>
>> 2016-05-06 09:49:17,813 INFO
>> [sparkDriver-akka.actor.default-dispatcher-17]
>> org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output
>> locations for shuffle 1 to hdn6.xactlycorporation.local:44503
>>
>>
>> As far as I know driver is just driving shuffle operation but not
>> actually doing anything within its own system that will cause memory issue.
>> Can you explain in what circumstances I could see this error in driver
>> logs? I don't do any collect or any other driver operation that would cause
>> this. It fails when doing aggregateByKey operation but that should happen
>> in executor JVM NOT in driver JVM.
>>
>>
>> Thanks
>>
>> On Sat, May 7, 2016 at 11:58 AM, Ted Yu  wrote:
>>
>>> bq.   at akka.serialization.JavaSerializer.toBinary(
>>> Serializer.scala:129)
>>>
>>> It was Akka which uses JavaSerializer
>>>
>>> Cheers
>>>
>>> On Sat, May 7, 2016 at 11:13 AM, Nirav Patel 
>>> wrote:
>>>
 Hi,

 I thought I was using kryo serializer for shuffle.  I could verify it
 from spark UI - Environment tab that
 spark.serializer org.apache.spark.serializer.KryoSerializer
 spark.kryo.registrator
 com.myapp.spark.jobs.conf.SparkSerializerRegistrator


 But when I see following error in Driver logs it looks like spark is
 using JavaSerializer

 2016-05-06 09:49:26,490 ERROR
 [sparkDriver-akka.actor.default-dispatcher-17] akka.actor.ActorSystemImpl:
 Uncaught fatal error from thread
 [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
 ActorSystem [sparkDriver]

 java.lang.OutOfMemoryError: Java heap space

 at java.util.Arrays.copyOf(Arrays.java:2271)

 at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)

 at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)

 at
 java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)

 at
 java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)

 at
 java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)

 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)

 at
 java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)

 at
 akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)

 at
 akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)

 at
 akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)

 at
 scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

 at
 akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)

 at
 akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)

 at
 akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)

 at
 akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)

 at
 scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

 at
 akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)

 at 

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Ashish Dubey
Driver maintains the complete metadata of application ( scheduling of
executor and maintaining the messaging to control the execution )
This code seems to be failing in that code path only. With that said there
is Jvm overhead based on num of executors , stages and tasks in your app.
Do you know your driver heap size and application structure ( num of stages
and tasks )

Ashish
On Saturday, May 7, 2016, Nirav Patel  wrote:

> Right but this logs from spark driver and spark driver seems to use Akka.
>
> ERROR [sparkDriver-akka.actor.default-dispatcher-17]
> akka.actor.ActorSystemImpl: Uncaught fatal error from thread
> [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down
> ActorSystem [sparkDriver]
>
> I saw following logs before above happened.
>
> 2016-05-06 09:49:17,813 INFO
> [sparkDriver-akka.actor.default-dispatcher-17]
> org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output
> locations for shuffle 1 to hdn6.xactlycorporation.local:44503
>
>
> As far as I know driver is just driving shuffle operation but not actually
> doing anything within its own system that will cause memory issue. Can you
> explain in what circumstances I could see this error in driver logs? I
> don't do any collect or any other driver operation that would cause this.
> It fails when doing aggregateByKey operation but that should happen in
> executor JVM NOT in driver JVM.
>
>
> Thanks
>
> On Sat, May 7, 2016 at 11:58 AM, Ted Yu  > wrote:
>
>> bq.   at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
>>
>> It was Akka which uses JavaSerializer
>>
>> Cheers
>>
>> On Sat, May 7, 2016 at 11:13 AM, Nirav Patel > > wrote:
>>
>>> Hi,
>>>
>>> I thought I was using kryo serializer for shuffle.  I could verify it
>>> from spark UI - Environment tab that
>>> spark.serializer org.apache.spark.serializer.KryoSerializer
>>> spark.kryo.registrator
>>> com.myapp.spark.jobs.conf.SparkSerializerRegistrator
>>>
>>>
>>> But when I see following error in Driver logs it looks like spark is
>>> using JavaSerializer
>>>
>>> 2016-05-06 09:49:26,490 ERROR
>>> [sparkDriver-akka.actor.default-dispatcher-17] akka.actor.ActorSystemImpl:
>>> Uncaught fatal error from thread
>>> [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
>>> ActorSystem [sparkDriver]
>>>
>>> java.lang.OutOfMemoryError: Java heap space
>>>
>>> at java.util.Arrays.copyOf(Arrays.java:2271)
>>>
>>> at
>>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>>
>>> at
>>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>>>
>>> at
>>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>>
>>> at
>>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>>>
>>> at
>>> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>>>
>>> at
>>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>>>
>>> at
>>> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>>>
>>> at
>>> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
>>>
>>> at
>>> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
>>>
>>> at
>>> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
>>>
>>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>
>>> at
>>> akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
>>>
>>> at
>>> akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
>>>
>>> at
>>> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>>>
>>> at
>>> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>>>
>>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>
>>> at
>>> akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
>>>
>>> at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
>>>
>>> at
>>> akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:718)
>>>
>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>>
>>> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
>>>
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>>
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>>
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>>
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>>
>>> at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>>>
>>> at
>>> 

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
Right but this logs from spark driver and spark driver seems to use Akka.

ERROR [sparkDriver-akka.actor.default-dispatcher-17]
akka.actor.ActorSystemImpl: Uncaught fatal error from thread
[sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down
ActorSystem [sparkDriver]

I saw following logs before above happened.

2016-05-06 09:49:17,813 INFO [sparkDriver-akka.actor.default-dispatcher-17]
org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output
locations for shuffle 1 to hdn6.xactlycorporation.local:44503


As far as I know driver is just driving shuffle operation but not actually
doing anything within its own system that will cause memory issue. Can you
explain in what circumstances I could see this error in driver logs? I
don't do any collect or any other driver operation that would cause this.
It fails when doing aggregateByKey operation but that should happen in
executor JVM NOT in driver JVM.


Thanks

On Sat, May 7, 2016 at 11:58 AM, Ted Yu  wrote:

> bq.   at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
>
> It was Akka which uses JavaSerializer
>
> Cheers
>
> On Sat, May 7, 2016 at 11:13 AM, Nirav Patel 
> wrote:
>
>> Hi,
>>
>> I thought I was using kryo serializer for shuffle.  I could verify it
>> from spark UI - Environment tab that
>> spark.serializer org.apache.spark.serializer.KryoSerializer
>> spark.kryo.registrator
>> com.myapp.spark.jobs.conf.SparkSerializerRegistrator
>>
>>
>> But when I see following error in Driver logs it looks like spark is
>> using JavaSerializer
>>
>> 2016-05-06 09:49:26,490 ERROR
>> [sparkDriver-akka.actor.default-dispatcher-17] akka.actor.ActorSystemImpl:
>> Uncaught fatal error from thread
>> [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
>> ActorSystem [sparkDriver]
>>
>> java.lang.OutOfMemoryError: Java heap space
>>
>> at java.util.Arrays.copyOf(Arrays.java:2271)
>>
>> at
>> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>>
>> at
>> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>>
>> at
>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>>
>> at
>> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>>
>> at
>> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>>
>> at
>> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>>
>> at
>> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>>
>> at
>> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
>>
>> at
>> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
>>
>> at
>> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
>>
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>
>> at
>> akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
>>
>> at
>> akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
>>
>> at
>> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>>
>> at
>> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>>
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>
>> at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
>>
>> at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
>>
>> at
>> akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:718)
>>
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>>
>> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
>>
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>>
>> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>>
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>>
>> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>>
>> at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>>
>>
>> What I am missing here?
>>
>> Thanks
>>
>>
>>
>> [image: What's New with Xactly] 
>>
>>   [image: LinkedIn]
>>   [image: Twitter]
>>   [image: Facebook]
>> 

Re: How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Ted Yu
bq.   at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)

It was Akka which uses JavaSerializer

Cheers

On Sat, May 7, 2016 at 11:13 AM, Nirav Patel  wrote:

> Hi,
>
> I thought I was using kryo serializer for shuffle.  I could verify it from
> spark UI - Environment tab that
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.kryo.registrator
> com.myapp.spark.jobs.conf.SparkSerializerRegistrator
>
>
> But when I see following error in Driver logs it looks like spark is using
> JavaSerializer
>
> 2016-05-06 09:49:26,490 ERROR
> [sparkDriver-akka.actor.default-dispatcher-17] akka.actor.ActorSystemImpl:
> Uncaught fatal error from thread
> [sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
> ActorSystem [sparkDriver]
>
> java.lang.OutOfMemoryError: Java heap space
>
> at java.util.Arrays.copyOf(Arrays.java:2271)
>
> at
> java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
>
> at
> java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
>
> at
> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
>
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
>
> at
> java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
>
> at
> java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
>
> at
> java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
>
> at
> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
>
> at
> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
>
> at
> akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
>
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>
> at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
>
> at
> akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
>
> at
> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>
> at
> akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
>
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>
> at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
>
> at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
>
> at
> akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:718)
>
> at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
>
> at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
>
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
>
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
>
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
>
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
>
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
>
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
>
> What I am missing here?
>
> Thanks
>
>
>
> [image: What's New with Xactly] 
>
>   [image: LinkedIn]
>   [image: Twitter]
>   [image: Facebook]
>   [image: YouTube]
> 


How to verify if spark is using kryo serializer for shuffle

2016-05-07 Thread Nirav Patel
Hi,

I thought I was using kryo serializer for shuffle.  I could verify it from
spark UI - Environment tab that
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.kryo.registrator com.myapp.spark.jobs.conf.SparkSerializerRegistrator


But when I see following error in Driver logs it looks like spark is using
JavaSerializer

2016-05-06 09:49:26,490 ERROR
[sparkDriver-akka.actor.default-dispatcher-17] akka.actor.ActorSystemImpl:
Uncaught fatal error from thread
[sparkDriver-akka.remote.default-remote-dispatcher-6] shutting down
ActorSystem [sparkDriver]

java.lang.OutOfMemoryError: Java heap space

at java.util.Arrays.copyOf(Arrays.java:2271)

at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)

at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)

at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)

at
java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)

at
java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)

at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)

at
java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)

at
akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)

at
akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)

at
akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)

at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)

at
akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)

at
akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)

at
akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)

at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)

at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)

at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)

at
akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:718)

at akka.actor.Actor$class.aroundReceive(Actor.scala:467)

at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)

at akka.actor.ActorCell.invoke(ActorCell.scala:487)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)

at akka.dispatch.Mailbox.run(Mailbox.scala:220)

at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)

at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)

at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)

at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)

at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



What I am missing here?

Thanks

-- 


[image: What's New with Xactly] 

  [image: LinkedIn] 
  [image: Twitter] 
  [image: Facebook] 
  [image: YouTube]