Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
Hello,

We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers).
We use spark-submit to start an application.

We got the following error which leads to a failed stage:

Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4
times, most recent failure: Lost task 3095.3 in stage 140.0 (TID
308697, ip-10-0-12-88.ec2.internal): org.apache.spark.SparkException:
Error communicating with MapOutputTracker


We tried the whole application again, and it failed on the same stage (but
it got more tasks completed on that stage) with the same error.

We then looked at executors stderr, and all show similar logs, on both runs
(see below). As far as we can tell, executors and master have disk space
left.

*Any suggestion on where to look to understand why the communication with
the MapOutputTracker fails?*

Thanks
Thomas

In case it matters, our akka settings:
spark.akka.frameSize 50
spark.akka.threads 8
// those below are 10* the default, to cope with large GCs
spark.akka.timeout 1000
spark.akka.heartbeat.pauses 6
spark.akka.failure-detector.threshold 3000.0
spark.akka.heartbeat.interval 1

Appendix: executor logs, where it starts going awry

15/03/04 11:45:00 INFO CoarseGrainedExecutorBackend: Got assigned task 298525
15/03/04 11:45:00 INFO Executor: Running task 3083.0 in stage 140.0 (TID 298525)
15/03/04 11:45:00 INFO MemoryStore: ensureFreeSpace(1473) called with
curMem=5543008799, maxMem=18127202549
15/03/04 11:45:00 INFO MemoryStore: Block broadcast_339_piece0 stored
as bytes in memory (estimated size 1473.0 B, free 11.7 GB)
15/03/04 11:45:00 INFO BlockManagerMaster: Updated info of block
broadcast_339_piece0
15/03/04 11:45:00 INFO TorrentBroadcast: Reading broadcast variable
339 took 224 ms
15/03/04 11:45:00 INFO MemoryStore: ensureFreeSpace(2536) called with
curMem=5543010272, maxMem=18127202549
15/03/04 11:45:00 INFO MemoryStore: Block broadcast_339 stored as
values in memory (estimated size 2.5 KB, free 11.7 GB)
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Doing the fetch;
tracker actor =
Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:52380/user/MapOutputTracker#-2057016370]
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, fetching them
15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs
for shuffle 18, f

Re: Error communicating with MapOutputTracker

2015-05-15 Thread Thomas Gerber
Hi Imran,

Thanks for the advice, tweaking with some akka parameters helped. See below.

Now, we noticed that we get java heap OOM exceptions on the output tracker
when we have too many tasks. I wonder:
1. where does the map output tracker live? The driver? The master (when
those are not the same)?
2. how can we increase the heap for it? Especially when using spark-submit?

Thanks,
Thomas

PS: akka parameter that one might want to increase:
# akka timeouts/heartbeats settings multiplied by 10 to avoid problems
spark.akka.timeout 1000
spark.akka.heartbeat.pauses 6
spark.akka.failure-detector.threshold 3000.0
spark.akka.heartbeat.interval 1

# Hidden akka conf to avoid MapOutputTracker timeouts
# See
https://github.com/apache/spark/blob/branch-1.3/core/src/main/scala/org/apache/spark/util/AkkaUtils.scala
spark.akka.askTimeout 300
spark.akka.lookupTimeout 300

On Fri, Mar 20, 2015 at 9:18 AM, Imran Rashid  wrote:

> Hi Thomas,
>
> sorry for such a late reply.  I don't have any super-useful advice, but
> this seems like something that is important to follow up on.  to answer
> your immediate question, No, there should not be any hard limit to the
> number of tasks that MapOutputTracker can handle.  Though of course as
> things get bigger, the overheads increase which is why you might hit
> timeouts.
>
> Two other minor suggestions:
> (1) increase spark.akka.askTimeout -- thats the timeout you are running
> into, it defaults to 30 seconds
> (2) as you've noted, you've needed to play w/ other timeouts b/c of long
> GC pauses -- its possible some GC tuning might help, though its a bit of a
> black art so its hard to say what you can try.  You cold always try
> Concurrent Mark Swee to avoid the long pauses, but of course that will
> probably hurt overall performance.
>
> can you share any more details of what you are trying to do?
>
> Since you're fetching shuffle blocks in a shuffle map task, I guess you've
> got two shuffles back-to-back, eg.
> someRDD.reduceByKey{...}.map{...}.filter{...}.combineByKey{...}.  Do you
> expect to be doing a lot of GC in between the two shuffles?? -eg., in the
> little example I have, if there were lots of objects being created in the
> map & filter steps that will make it out of the eden space.  One possible
> solution to this would be to force the first shuffle to complete, before
> running any of the subsequent transformations, eg. by forcing
> materialization to the cache first
>
> val intermediateRDD = someRDD.reduceByKey{...}.persist(DISK)
> intermediateRDD.count() // force the shuffle to complete, without trying
> to do our complicated downstream logic at the same time
>
> val finalResult = intermediateRDD.map{...}.filter{...}.combineByKey{...}
>
> Also, can you share your data size?  Do you expect the shuffle to be
> skewed, or do you think it will be well-balanced?  Not that I'll have any
> suggestions for you based on the answer, but it may help us reproduce it
> and try to fix whatever the root cause is.
>
> thanks,
> Imran
>
>
>
> On Wed, Mar 4, 2015 at 12:30 PM, Thomas Gerber 
> wrote:
>
>> I meant spark.default.parallelism of course.
>>
>> On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber 
>> wrote:
>>
>>> Follow up:
>>> We re-retried, this time after *decreasing* spark.parallelism. It was
>>> set to 16000 before, (5 times the number of cores in our cluster). It is
>>> now down to 6400 (2 times the number of cores).
>>>
>>> And it got past the point where it failed before.
>>>
>>> Does the MapOutputTracker have a limit on the number of tasks it can
>>> track?
>>>
>>>
>>> On Wed, Mar 4, 2015 at 8:15 AM, Thomas Gerber 
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge
>>>> workers). We use spark-submit to start an application.
>>>>
>>>> We got the following error which leads to a failed stage:
>>>>
>>>> Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4 times, 
>>>> most recent failure: Lost task 3095.3 in stage 140.0 (TID 308697, 
>>>> ip-10-0-12-88.ec2.internal): org.apache.spark.SparkException: Error 
>>>> communicating with MapOutputTracker
>>>>
>>>>
>>>> We tried the whole application again, and it failed on the same stage
>>>> (but it got more tasks completed on that stage) with the same error.
>>>>
>>>> We then looked at executors stderr, and all show similar logs, on both
>>>> runs (see below). As far as we can tell, executors and master hav

Re: Error communicating with MapOutputTracker

2015-05-18 Thread Imran Rashid
On Fri, May 15, 2015 at 5:09 PM, Thomas Gerber 
wrote:

> Now, we noticed that we get java heap OOM exceptions on the output tracker
> when we have too many tasks. I wonder:
> 1. where does the map output tracker live? The driver? The master (when
> those are not the same)?
> 2. how can we increase the heap for it? Especially when using spark-submit?
>

It does not live on the master -- that is only in a standalone cluster, and
it does very little work.  (Though there are *Master and *Worker variants
of the class, its really running on the driver and the executors.)  If you
are getting OOMs in the MapOutputTrackerMaster (which lives on the driver),
then you can increase the memory for the driver via the normal args for
controlling driver memory, with "--driver-memory 10G" or whatever.

Just to be clear, if you hit an OOM from somewhere in the MapOutputTracker
code, it just means that code is what pushed things over the top.  Of
course you could have 99% of your memory used by something else, perhaps
your own data structures, which perhaps could be trimmed down.  You could
get a heap dump on the driver to see where the memory is really getting
used.

Do you mind sharing the details of how you hit these OOMs?  How much
memory, how many partitions on each side of the shuffle?  Sort based
shuffle I assume?

thanks,
Imran


Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
Follow up:
We re-retried, this time after *decreasing* spark.parallelism. It was set
to 16000 before, (5 times the number of cores in our cluster). It is now
down to 6400 (2 times the number of cores).

And it got past the point where it failed before.

Does the MapOutputTracker have a limit on the number of tasks it can track?


On Wed, Mar 4, 2015 at 8:15 AM, Thomas Gerber 
wrote:

> Hello,
>
> We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge workers).
> We use spark-submit to start an application.
>
> We got the following error which leads to a failed stage:
>
> Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4 times, 
> most recent failure: Lost task 3095.3 in stage 140.0 (TID 308697, 
> ip-10-0-12-88.ec2.internal): org.apache.spark.SparkException: Error 
> communicating with MapOutputTracker
>
>
> We tried the whole application again, and it failed on the same stage (but
> it got more tasks completed on that stage) with the same error.
>
> We then looked at executors stderr, and all show similar logs, on both
> runs (see below). As far as we can tell, executors and master have disk
> space left.
>
> *Any suggestion on where to look to understand why the communication with
> the MapOutputTracker fails?*
>
> Thanks
> Thomas
> 
> In case it matters, our akka settings:
> spark.akka.frameSize 50
> spark.akka.threads 8
> // those below are 10* the default, to cope with large GCs
> spark.akka.timeout 1000
> spark.akka.heartbeat.pauses 6
> spark.akka.failure-detector.threshold 3000.0
> spark.akka.heartbeat.interval 1
>
> Appendix: executor logs, where it starts going awry
>
> 15/03/04 11:45:00 INFO CoarseGrainedExecutorBackend: Got assigned task 298525
> 15/03/04 11:45:00 INFO Executor: Running task 3083.0 in stage 140.0 (TID 
> 298525)
> 15/03/04 11:45:00 INFO MemoryStore: ensureFreeSpace(1473) called with 
> curMem=5543008799, maxMem=18127202549
> 15/03/04 11:45:00 INFO MemoryStore: Block broadcast_339_piece0 stored as 
> bytes in memory (estimated size 1473.0 B, free 11.7 GB)
> 15/03/04 11:45:00 INFO BlockManagerMaster: Updated info of block 
> broadcast_339_piece0
> 15/03/04 11:45:00 INFO TorrentBroadcast: Reading broadcast variable 339 took 
> 224 ms
> 15/03/04 11:45:00 INFO MemoryStore: ensureFreeSpace(2536) called with 
> curMem=5543010272, maxMem=18127202549
> 15/03/04 11:45:00 INFO MemoryStore: Block broadcast_339 stored as values in 
> memory (estimated size 2.5 KB, free 11.7 GB)
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Doing the fetch; tracker actor 
> = 
> Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:52380/user/MapOutputTracker#-2057016370]
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
> shuffle 18, fetching them
> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs f

Re: Error communicating with MapOutputTracker

2015-03-04 Thread Thomas Gerber
I meant spark.default.parallelism of course.

On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber 
wrote:

> Follow up:
> We re-retried, this time after *decreasing* spark.parallelism. It was set
> to 16000 before, (5 times the number of cores in our cluster). It is now
> down to 6400 (2 times the number of cores).
>
> And it got past the point where it failed before.
>
> Does the MapOutputTracker have a limit on the number of tasks it can track?
>
>
> On Wed, Mar 4, 2015 at 8:15 AM, Thomas Gerber 
> wrote:
>
>> Hello,
>>
>> We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge
>> workers). We use spark-submit to start an application.
>>
>> We got the following error which leads to a failed stage:
>>
>> Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4 times, 
>> most recent failure: Lost task 3095.3 in stage 140.0 (TID 308697, 
>> ip-10-0-12-88.ec2.internal): org.apache.spark.SparkException: Error 
>> communicating with MapOutputTracker
>>
>>
>> We tried the whole application again, and it failed on the same stage
>> (but it got more tasks completed on that stage) with the same error.
>>
>> We then looked at executors stderr, and all show similar logs, on both
>> runs (see below). As far as we can tell, executors and master have disk
>> space left.
>>
>> *Any suggestion on where to look to understand why the communication with
>> the MapOutputTracker fails?*
>>
>> Thanks
>> Thomas
>> 
>> In case it matters, our akka settings:
>> spark.akka.frameSize 50
>> spark.akka.threads 8
>> // those below are 10* the default, to cope with large GCs
>> spark.akka.timeout 1000
>> spark.akka.heartbeat.pauses 6
>> spark.akka.failure-detector.threshold 3000.0
>> spark.akka.heartbeat.interval 1
>>
>> Appendix: executor logs, where it starts going awry
>>
>> 15/03/04 11:45:00 INFO CoarseGrainedExecutorBackend: Got assigned task 298525
>> 15/03/04 11:45:00 INFO Executor: Running task 3083.0 in stage 140.0 (TID 
>> 298525)
>> 15/03/04 11:45:00 INFO MemoryStore: ensureFreeSpace(1473) called with 
>> curMem=5543008799, maxMem=18127202549
>> 15/03/04 11:45:00 INFO MemoryStore: Block broadcast_339_piece0 stored as 
>> bytes in memory (estimated size 1473.0 B, free 11.7 GB)
>> 15/03/04 11:45:00 INFO BlockManagerMaster: Updated info of block 
>> broadcast_339_piece0
>> 15/03/04 11:45:00 INFO TorrentBroadcast: Reading broadcast variable 339 took 
>> 224 ms
>> 15/03/04 11:45:00 INFO MemoryStore: ensureFreeSpace(2536) called with 
>> curMem=5543010272, maxMem=18127202549
>> 15/03/04 11:45:00 INFO MemoryStore: Block broadcast_339 stored as values in 
>> memory (estimated size 2.5 KB, free 11.7 GB)
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Doing the fetch; tracker 
>> actor = 
>> Actor[akka.tcp://sparkDriver@ip-10-0-10-17.ec2.internal:52380/user/MapOutputTracker#-2057016370]
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWorker: Don't have map outputs for 
>> shuffle 18, fetching them
>> 15/03/04 11:45:00 INFO MapOutputTrackerWork

Re: Error communicating with MapOutputTracker

2015-03-20 Thread Imran Rashid
Hi Thomas,

sorry for such a late reply.  I don't have any super-useful advice, but
this seems like something that is important to follow up on.  to answer
your immediate question, No, there should not be any hard limit to the
number of tasks that MapOutputTracker can handle.  Though of course as
things get bigger, the overheads increase which is why you might hit
timeouts.

Two other minor suggestions:
(1) increase spark.akka.askTimeout -- thats the timeout you are running
into, it defaults to 30 seconds
(2) as you've noted, you've needed to play w/ other timeouts b/c of long GC
pauses -- its possible some GC tuning might help, though its a bit of a
black art so its hard to say what you can try.  You cold always try
Concurrent Mark Swee to avoid the long pauses, but of course that will
probably hurt overall performance.

can you share any more details of what you are trying to do?

Since you're fetching shuffle blocks in a shuffle map task, I guess you've
got two shuffles back-to-back, eg.
someRDD.reduceByKey{...}.map{...}.filter{...}.combineByKey{...}.  Do you
expect to be doing a lot of GC in between the two shuffles?? -eg., in the
little example I have, if there were lots of objects being created in the
map & filter steps that will make it out of the eden space.  One possible
solution to this would be to force the first shuffle to complete, before
running any of the subsequent transformations, eg. by forcing
materialization to the cache first

val intermediateRDD = someRDD.reduceByKey{...}.persist(DISK)
intermediateRDD.count() // force the shuffle to complete, without trying to
do our complicated downstream logic at the same time

val finalResult = intermediateRDD.map{...}.filter{...}.combineByKey{...}

Also, can you share your data size?  Do you expect the shuffle to be
skewed, or do you think it will be well-balanced?  Not that I'll have any
suggestions for you based on the answer, but it may help us reproduce it
and try to fix whatever the root cause is.

thanks,
Imran



On Wed, Mar 4, 2015 at 12:30 PM, Thomas Gerber 
wrote:

> I meant spark.default.parallelism of course.
>
> On Wed, Mar 4, 2015 at 10:24 AM, Thomas Gerber 
> wrote:
>
>> Follow up:
>> We re-retried, this time after *decreasing* spark.parallelism. It was set
>> to 16000 before, (5 times the number of cores in our cluster). It is now
>> down to 6400 (2 times the number of cores).
>>
>> And it got past the point where it failed before.
>>
>> Does the MapOutputTracker have a limit on the number of tasks it can
>> track?
>>
>>
>> On Wed, Mar 4, 2015 at 8:15 AM, Thomas Gerber 
>> wrote:
>>
>>> Hello,
>>>
>>> We are using spark 1.2.1 on a very large cluster (100 c3.8xlarge
>>> workers). We use spark-submit to start an application.
>>>
>>> We got the following error which leads to a failed stage:
>>>
>>> Job aborted due to stage failure: Task 3095 in stage 140.0 failed 4 times, 
>>> most recent failure: Lost task 3095.3 in stage 140.0 (TID 308697, 
>>> ip-10-0-12-88.ec2.internal): org.apache.spark.SparkException: Error 
>>> communicating with MapOutputTracker
>>>
>>>
>>> We tried the whole application again, and it failed on the same stage
>>> (but it got more tasks completed on that stage) with the same error.
>>>
>>> We then looked at executors stderr, and all show similar logs, on both
>>> runs (see below). As far as we can tell, executors and master have disk
>>> space left.
>>>
>>> *Any suggestion on where to look to understand why the communication
>>> with the MapOutputTracker fails?*
>>>
>>> Thanks
>>> Thomas
>>> 
>>> In case it matters, our akka settings:
>>> spark.akka.frameSize 50
>>> spark.akka.threads 8
>>> // those below are 10* the default, to cope with large GCs
>>> spark.akka.timeout 1000
>>> spark.akka.heartbeat.pauses 6
>>> spark.akka.failure-detector.threshold 3000.0
>>> spark.akka.heartbeat.interval 1
>>>
>>> Appendix: executor logs, where it starts going awry
>>>
>>> 15/03/04 11:45:00 INFO CoarseGrainedExecutorBackend: Got assigned task 
>>> 298525
>>> 15/03/04 11:45:00 INFO Executor: Running task 3083.0 in stage 140.0 (TID 
>>> 298525)
>>> 15/03/04 11:45:00 INFO MemoryStore: ensureFreeSpace(1473) called with 
>>> curMem=5543008799, maxMem=18127202549
>>> 15/03/04 11:45:00 INFO MemoryStore: Block broadcast_339_piece0 stored as 
>>> bytes in memory (estimated size 1473.0 B, free 11.7 GB)
>>> 15/03/04 11:45:00 INFO BlockManagerMaster: Updated info of block 
>>> 

[spark upgrade] Error communicating with MapOutputTracker when running test cases in latest spark

2014-09-10 Thread Adrian Mocanu
I use
https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
to help me with testing.

In spark 9.1 my tests depending on TestSuiteBase worked fine. As soon as I 
switched to latest (1.0.1) all tests fail. My sbt import is: "org.apache.spark" 
%% "spark-core" % "1.1.0-SNAPSHOT" % "provided"

One exception I get is:
Error communicating with MapOutputTracker
org.apache.spark.SparkException: Error communicating with MapOutputTracker 

How can I fix this?

Found a thread on this error but not very helpful: 
http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3ctencent_6b37d69c54f76819509a5...@qq.com%3E

-Adrian