I suggest taking a heap dump of driver process using jmap. Then open that
dump in a tool like Visual VM to see which object(s) are taking up heap
space. It is easy to do. We did this and found out that in our case it was
the data structure that stores info about stages, jobs and tasks. There can
be other reasons as well, of course.
On Aug 27, 2015 4:17 AM, <andrew.row...@thomsonreuters.com> wrote:

> I should have mentioned: yes I am using Kryo and have registered KeyClass
> and ValueClass.
>
>
>
> I guess it’s not clear to me what is actually taking up space on the
> driver heap - I can’t see how it can be data with the code that I have.
>
> On 27/08/2015 12:09, "Ewan Leith" <ewan.le...@realitymine.com> wrote:
>
> >Are you using the Kryo serializer? If not, have a look at it, it can save
> a lot of memory during shuffles
> >
> >https://spark.apache.org/docs/latest/tuning.html
> >
> >I did a similar task and had various issues with the volume of data being
> parsed in one go, but that helped a lot. It looks like the main difference
> from what you're doing to me is that my input classes were just a string
> and a byte array, which I then processed once it was read into the RDD,
> maybe your classes are memory heavy?
> >
> >
> >Thanks,
> >Ewan
> >
> >-----Original Message-----
> >From: andrew.row...@thomsonreuters.com [mailto:
> andrew.row...@thomsonreuters.com]
> >Sent: 27 August 2015 11:53
> >To: user@spark.apache.org
> >Subject: Driver running out of memory - caused by many tasks?
> >
> >I have a spark v.1.4.1 on YARN job where the first stage has ~149,000
> tasks (it’s reading a few TB of data). The job itself is fairly simple -
> it’s just getting a list of distinct values:
> >
> >    val days = spark
> >      .sequenceFile(inputDir, classOf[KeyClass], classOf[ValueClass])
> >      .sample(withReplacement = false, fraction = 0.01)
> >      .map(row => row._1.getTimestamp.toString("yyyy-MM-dd"))
> >      .distinct()
> >      .collect()
> >
> >The cardinality of the ‘day’ is quite small - there’s only a handful.
> However, I’m frequently running into OutOfMemory issues on the driver. I’ve
> had it fail with 24GB RAM, and am currently nudging it upwards to find out
> where it works. The ratio between input and shuffle write in the distinct
> stage is about 3TB:7MB. On a smaller dataset, it works without issue on a
> smaller (4GB) heap. In YARN cluster mode, I get a failure message similar
> to:
> >
> >    Container
> [pid=36844,containerID=container_e15_1438040390147_4982_01_000001] is
> running beyond physical memory limits. Current usage: 27.6 GB of 27 GB
> physical memory used; 29.5 GB of 56.7 GB virtual memory used. Killing
> container.
> >
> >
> >Is the driver running out of memory simply due to the number of tasks, or
> is there something about the job program that’s causing it to put a lot of
> data into the driver heap and go oom? If the former, is there any general
> guidance about the amount of memory to give to the driver as a function of
> how many tasks there are?
> >
> >Andrew

Reply via email to