I've tried various ideas, but I'm really just shooting in the dark.

I have an 8 node cluster of r3.8xlarge machines. The RDD (with 1024 partitions) 
I'm trying to save off to S3 is approximately 1TB in size (with the partitions 
pretty evenly distributed in size).

I just tried a test to dial back the number of executors on my cluster from 
using the entire cluster (256 cores) down to 128.  Things seemed to get a bit 
farther (maybe) before the wheels started spinning off again.  But, the job 
always fails when all I'm trying to do is save the 1TB file to S3.

I see the following in my master log file.

15/01/23 19:01:54 WARN master.Master: Removing worker-20150123172316 because we 
got no heartbeat in 60 seconds
15/01/23 19:01:54 INFO master.Master: Removing worker worker-20150123172316 on 
15/01/23 19:01:54 INFO master.Master: Telling app of lost executor: 3

For the stage that eventually fails, I see the following summary information.

Summary Metrics for 729 Completed Tasks
Duration 2.5 min 4.8 min 5.5 min 6.3 min 9.2 min 
GC Time   0 ms 0.3 s 0.4 s 0.5 s 5 s 

Shuffle Read (Remote) 309.3 MB 321.7 MB 325.4 MB 329.6 MB 350.6 MB 

So, the max GC was only 5s for 729 completed tasks.  This sounds reasonable.  
As people tend to indicate GC is the reason one loses executors, this does not 
appear to be my case.

Here is a typical snapshot for some completed tasks.  So, you can see that they 
tend to complete in approximately 6 minutes.  So, it takes about 6 minutes to 
write one partition to S3 (a partition being roughly 1 GB)

65      23619   0       SUCCESS         ANY     5 /  2015/01/23 18:30:32        
5.8 min         0.9 s   344.6 MB 
59      23613   0       SUCCESS         ANY     7 /  2015/01/23 18:30:32        
6.0 min         0.4 s   324.1 MB 
68      23622   0       SUCCESS         ANY     1 /  2015/01/23 18:30:32        
5.7 min         0.5 s   329.9 MB 
62      23616   0       SUCCESS         ANY     6 /  2015/01/23 18:30:32        
5.8 min         0.7 s   326.4 MB 
61      23615   0       SUCCESS         ANY     3 /  2015/01/23 18:30:32        
5.5 min         1 s     335.7 MB 
64      23618   0       SUCCESS         ANY     2 /  2015/01/23 18:30:32        
5.6 min         2 s     328.1 MB 

Then towards the end, when things start heading south, I see the following.  
These tasks never complete but you can see that they have taken more than 47 
minutes (so far) before the job finally fails.  Not really sure why.

671     24225   0       RUNNING         ANY     1 /  2015/01/23 18:59:14        
47 min 
672     24226   0       RUNNING         ANY     1 /  2015/01/23 18:59:14        
47 min 
673     24227   0       RUNNING         ANY     1 /  2015/01/23 18:59:14        
47 min 
674     24228   0       RUNNING         ANY     1 /  2015/01/23 18:59:14        
47 min 
675     24229   0       RUNNING         ANY     1 /  2015/01/23 18:59:14        
47 min 
676     24230   0       RUNNING         ANY     1 /  2015/01/23 18:59:14        
47 min 
677     24231   0       RUNNING         ANY     1 /  2015/01/23 18:59:14        
47 min 
678     24232   0       RUNNING         ANY     1 /  2015/01/23 18:59:14        
47 min 
679     24233   0       RUNNING         ANY     1 /  2015/01/23 18:59:14        
47 min 
680     24234   0       RUNNING         ANY     1 /  2015/01/23 18:59:17        
47 min 
681     24235   0       RUNNING         ANY     1 /  2015/01/23 18:59:18        
47 min 
682     24236   0       RUNNING         ANY     1 /  2015/01/23 18:59:18        
47 min 
683     24237   0       RUNNING         ANY     5 /  2015/01/23 18:59:20        
47 min 
684     24238   0       RUNNING         ANY     5 /  2015/01/23 18:59:20        
47 min 
685     24239   0       RUNNING         ANY     5 /  2015/01/23 18:59:20        
47 min 
686     24240   0       RUNNING         ANY     5 /  2015/01/23 18:59:20        
47 min 
687     24241   0       RUNNING         ANY     5 /  2015/01/23 18:59:20        
47 min 
688     24242   0       RUNNING         ANY     5 /  2015/01/23 18:59:20        
47 min 
689     24243   0       RUNNING         ANY     5 /  2015/01/23 18:59:20        
47 min 
690     24244   0       RUNNING         ANY     5 /  2015/01/23 18:59:20        
47 min 
691     24245   0       RUNNING         ANY     5 /  2015/01/23 18:59:21        
47 min 

What's odd is that even on the same machine (see below) some tasks are still 
completing (in less than 5 minutes) while other tasks on the same machine seem 
to be hung after 46 minutes.  Keep in mind all I'm doing is saving the file to 
S3 so one would think the amount of work per task/partition would be fairly 
equal.

694     24248   0       SUCCESS         ANY     0 /  2015/01/23 18:59:32        
4.5 min         0.3 s   326.5 MB 
695     24249   0       SUCCESS         ANY     0 /  2015/01/23 18:59:32        
4.5 min         0.3 s   330.8 MB 
696     24250   0       RUNNING         ANY     0 /  2015/01/23 18:59:32        
46 min 
697     24251   0       RUNNING         ANY     0 /  2015/01/23 18:59:32        
46 min 
698     24252   0       SUCCESS         ANY     0 /  2015/01/23 18:59:32        
4.5 min         0.3 s   325.8 MB 
699     24253   0       SUCCESS         ANY     0 /  2015/01/23 18:59:32        
4.5 min         0.3 s   325.2 MB 
700     24254   0       SUCCESS         ANY     0 /  2015/01/23 18:59:32        
4.5 min         0.3 s   323.4 MB 

If anyone has some suggestions please let me know.  I've tried playing around 
with various configuration options but I've found nothing yet that will fix the 
underlying issue.  

Thanks.

Darin.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to