Re: Driver running out of memory - caused by many tasks?

2015-08-27 Thread Ashish Rangole
I suggest taking a heap dump of driver process using jmap. Then open that
dump in a tool like Visual VM to see which object(s) are taking up heap
space. It is easy to do. We did this and found out that in our case it was
the data structure that stores info about stages, jobs and tasks. There can
be other reasons as well, of course.
On Aug 27, 2015 4:17 AM, andrew.row...@thomsonreuters.com wrote:

 I should have mentioned: yes I am using Kryo and have registered KeyClass
 and ValueClass.



 I guess it’s not clear to me what is actually taking up space on the
 driver heap - I can’t see how it can be data with the code that I have.

 On 27/08/2015 12:09, Ewan Leith ewan.le...@realitymine.com wrote:

 Are you using the Kryo serializer? If not, have a look at it, it can save
 a lot of memory during shuffles
 
 https://spark.apache.org/docs/latest/tuning.html
 
 I did a similar task and had various issues with the volume of data being
 parsed in one go, but that helped a lot. It looks like the main difference
 from what you're doing to me is that my input classes were just a string
 and a byte array, which I then processed once it was read into the RDD,
 maybe your classes are memory heavy?
 
 
 Thanks,
 Ewan
 
 -Original Message-
 From: andrew.row...@thomsonreuters.com [mailto:
 andrew.row...@thomsonreuters.com]
 Sent: 27 August 2015 11:53
 To: user@spark.apache.org
 Subject: Driver running out of memory - caused by many tasks?
 
 I have a spark v.1.4.1 on YARN job where the first stage has ~149,000
 tasks (it’s reading a few TB of data). The job itself is fairly simple -
 it’s just getting a list of distinct values:
 
 val days = spark
   .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])
   .sample(withReplacement = false, fraction = 0.01)
   .map(row = row._1.getTimestamp.toString(-MM-dd))
   .distinct()
   .collect()
 
 The cardinality of the ‘day’ is quite small - there’s only a handful.
 However, I’m frequently running into OutOfMemory issues on the driver. I’ve
 had it fail with 24GB RAM, and am currently nudging it upwards to find out
 where it works. The ratio between input and shuffle write in the distinct
 stage is about 3TB:7MB. On a smaller dataset, it works without issue on a
 smaller (4GB) heap. In YARN cluster mode, I get a failure message similar
 to:
 
 Container
 [pid=36844,containerID=container_e15_1438040390147_4982_01_01] is
 running beyond physical memory limits. Current usage: 27.6 GB of 27 GB
 physical memory used; 29.5 GB of 56.7 GB virtual memory used. Killing
 container.
 
 
 Is the driver running out of memory simply due to the number of tasks, or
 is there something about the job program that’s causing it to put a lot of
 data into the driver heap and go oom? If the former, is there any general
 guidance about the amount of memory to give to the driver as a function of
 how many tasks there are?
 
 Andrew


Re: Driver running out of memory - caused by many tasks?

2015-08-27 Thread andrew.rowson
Thanks for this tip.

I ran it in yarn-client mode with driver-memory = 4G and took a dump once the 
heap got close to 4G.

 num#instances  #bytes  class name
--
   1:   446169  3661137256  [J
   2:   2032795 222636720   [C
   3:   899257717200[B
   4:   2031653 48759672java.lang.String
   5:   920263  22086312scala.collection.immutable.$colon$colon
   6:   156233  20495936[Ljava.lang.Object;


So, somehow there’s 3.5G worth of Long arrays in there, each one apparently 
using about 8k of heap. Digging into the instances of these reveals that the 
vast majority of these are identical long[] of size 1024 with each element set 
to -1 (8,216 bytes).

I’m at a bit of a loss as to what this could be. Any ideas how I can understand 
what’s happening?

Andrew



From:  Ashish Rangole
Date:  Thursday, 27 August 2015 15:24
To:  Andrew Rowson
Cc:  user, ewan.le...@realitymine.com
Subject:  Re: Driver running out of memory - caused by many tasks?


I suggest taking a heap dump of driver process using jmap. Then open that dump 
in a tool like Visual VM to see which object(s) are taking up heap space. It is 
easy to do. We did this and found out that in our case it was the data 
structure that
 stores info about stages, jobs and tasks. There can be other reasons as well, 
of course.

On Aug 27, 2015 4:17 AM, andrew.row...@thomsonreuters.com wrote:

I should have mentioned: yes I am using Kryo and have registered KeyClass and 
ValueClass.



I guess it’s not clear to me what is actually taking up space on the driver 
heap - I can’t see how it can be data with the code that I have.

On 27/08/2015 12:09, Ewan Leith ewan.le...@realitymine.com wrote:

Are you using the Kryo serializer? If not, have a look at it, it can save a 
lot of memory during shuffles

https://spark.apache.org/docs/latest/tuning.html

I did a similar task and had various issues with the volume of data being 
parsed in one go, but that helped a lot. It looks like the main difference 
from what you're doing to me is that my input classes were just a string and a 
byte array, which I then processed
 once it was read into the RDD, maybe your classes are memory heavy?


Thanks,
Ewan

-Original Message-
From: andrew.row...@thomsonreuters.com 
[mailto:andrew.row...@thomsonreuters.com]
Sent: 27 August 2015 11:53
To: user@spark.apache.org
Subject: Driver running out of memory - caused by many tasks?

I have a spark v.1.4.1 on YARN job where the first stage has ~149,000 tasks 
(it’s reading a few TB of data). The job itself is fairly simple - it’s just 
getting a list of distinct values:

val days = spark
  .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])
  .sample(withReplacement = false, fraction = 0.01)
  .map(row = row._1.getTimestamp.toString(-MM-dd))
  .distinct()
  .collect()

The cardinality of the ‘day’ is quite small - there’s only a handful. However, 
I’m frequently running into OutOfMemory issues on the driver. I’ve had it fail 
with 24GB RAM, and am currently nudging it upwards to find out where it works. 
The ratio between input
 and shuffle write in the distinct stage is about 3TB:7MB. On a smaller 
dataset, it works without issue on a smaller (4GB) heap. In YARN cluster mode, 
I get a failure message similar to:

Container 
 [pid=36844,containerID=container_e15_1438040390147_4982_01_01] is running 
 beyond physical memory limits. Current usage: 27.6 GB of 27 GB physical 
 memory used; 29.5 GB of 56.7 GB virtual memory used. Killing container.


Is the driver running out of memory simply due to the number of tasks, or is 
there something about the job program that’s causing it to put a lot of data 
into the driver heap and go oom? If the former, is there any general guidance 
about the amount of memory
 to give to the driver as a function of how many tasks there are?

Andrew

smime.p7s
Description: S/MIME cryptographic signature


RE: Driver running out of memory - caused by many tasks?

2015-08-27 Thread Ewan Leith
Are you using the Kryo serializer? If not, have a look at it, it can save a lot 
of memory during shuffles

https://spark.apache.org/docs/latest/tuning.html

I did a similar task and had various issues with the volume of data being 
parsed in one go, but that helped a lot. It looks like the main difference from 
what you're doing to me is that my input classes were just a string and a byte 
array, which I then processed once it was read into the RDD, maybe your classes 
are memory heavy?


Thanks,
Ewan

-Original Message-
From: andrew.row...@thomsonreuters.com 
[mailto:andrew.row...@thomsonreuters.com] 
Sent: 27 August 2015 11:53
To: user@spark.apache.org
Subject: Driver running out of memory - caused by many tasks?

I have a spark v.1.4.1 on YARN job where the first stage has ~149,000 tasks 
(it’s reading a few TB of data). The job itself is fairly simple - it’s just 
getting a list of distinct values:

val days = spark
  .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])
  .sample(withReplacement = false, fraction = 0.01)
  .map(row = row._1.getTimestamp.toString(-MM-dd))
  .distinct()
  .collect()

The cardinality of the ‘day’ is quite small - there’s only a handful. However, 
I’m frequently running into OutOfMemory issues on the driver. I’ve had it fail 
with 24GB RAM, and am currently nudging it upwards to find out where it works. 
The ratio between input and shuffle write in the distinct stage is about 
3TB:7MB. On a smaller dataset, it works without issue on a smaller (4GB) heap. 
In YARN cluster mode, I get a failure message similar to:

Container 
[pid=36844,containerID=container_e15_1438040390147_4982_01_01] is running 
beyond physical memory limits. Current usage: 27.6 GB of 27 GB physical memory 
used; 29.5 GB of 56.7 GB virtual memory used. Killing container.


Is the driver running out of memory simply due to the number of tasks, or is 
there something about the job program that’s causing it to put a lot of data 
into the driver heap and go oom? If the former, is there any general guidance 
about the amount of memory to give to the driver as a function of how many 
tasks there are?

Andrew


Re: Driver running out of memory - caused by many tasks?

2015-08-27 Thread andrew.rowson
I should have mentioned: yes I am using Kryo and have registered KeyClass and 
ValueClass.



I guess it’s not clear to me what is actually taking up space on the driver 
heap - I can’t see how it can be data with the code that I have.

On 27/08/2015 12:09, Ewan Leith ewan.le...@realitymine.com wrote:

Are you using the Kryo serializer? If not, have a look at it, it can save a 
lot of memory during shuffles

https://spark.apache.org/docs/latest/tuning.html

I did a similar task and had various issues with the volume of data being 
parsed in one go, but that helped a lot. It looks like the main difference 
from what you're doing to me is that my input classes were just a string and a 
byte array, which I then processed once it was read into the RDD, maybe your 
classes are memory heavy?


Thanks,
Ewan

-Original Message-
From: andrew.row...@thomsonreuters.com 
[mailto:andrew.row...@thomsonreuters.com] 
Sent: 27 August 2015 11:53
To: user@spark.apache.org
Subject: Driver running out of memory - caused by many tasks?

I have a spark v.1.4.1 on YARN job where the first stage has ~149,000 tasks 
(it’s reading a few TB of data). The job itself is fairly simple - it’s just 
getting a list of distinct values:

val days = spark
  .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])
  .sample(withReplacement = false, fraction = 0.01)
  .map(row = row._1.getTimestamp.toString(-MM-dd))
  .distinct()
  .collect()

The cardinality of the ‘day’ is quite small - there’s only a handful. However, 
I’m frequently running into OutOfMemory issues on the driver. I’ve had it fail 
with 24GB RAM, and am currently nudging it upwards to find out where it works. 
The ratio between input and shuffle write in the distinct stage is about 
3TB:7MB. On a smaller dataset, it works without issue on a smaller (4GB) heap. 
In YARN cluster mode, I get a failure message similar to:

Container 
 [pid=36844,containerID=container_e15_1438040390147_4982_01_01] is running 
 beyond physical memory limits. Current usage: 27.6 GB of 27 GB physical 
 memory used; 29.5 GB of 56.7 GB virtual memory used. Killing container.


Is the driver running out of memory simply due to the number of tasks, or is 
there something about the job program that’s causing it to put a lot of data 
into the driver heap and go oom? If the former, is there any general guidance 
about the amount of memory to give to the driver as a function of how many 
tasks there are?

Andrew

smime.p7s
Description: S/MIME cryptographic signature