I've tried various ideas, but I'm really just shooting in the dark. I have an 8 node cluster of r3.8xlarge machines. The RDD (with 1024 partitions) I'm trying to save off to S3 is approximately 1TB in size (with the partitions pretty evenly distributed in size).
I just tried a test to dial back the number of executors on my cluster from using the entire cluster (256 cores) down to 128. Things seemed to get a bit farther (maybe) before the wheels started spinning off again. But, the job always fails when all I'm trying to do is save the 1TB file to S3. I see the following in my master log file. 15/01/23 19:01:54 WARN master.Master: Removing worker-20150123172316 because we got no heartbeat in 60 seconds 15/01/23 19:01:54 INFO master.Master: Removing worker worker-20150123172316 on 15/01/23 19:01:54 INFO master.Master: Telling app of lost executor: 3 For the stage that eventually fails, I see the following summary information. Summary Metrics for 729 Completed Tasks Duration 2.5 min 4.8 min 5.5 min 6.3 min 9.2 min GC Time 0 ms 0.3 s 0.4 s 0.5 s 5 s Shuffle Read (Remote) 309.3 MB 321.7 MB 325.4 MB 329.6 MB 350.6 MB So, the max GC was only 5s for 729 completed tasks. This sounds reasonable. As people tend to indicate GC is the reason one loses executors, this does not appear to be my case. Here is a typical snapshot for some completed tasks. So, you can see that they tend to complete in approximately 6 minutes. So, it takes about 6 minutes to write one partition to S3 (a partition being roughly 1 GB) 65 23619 0 SUCCESS ANY 5 / 2015/01/23 18:30:32 5.8 min 0.9 s 344.6 MB 59 23613 0 SUCCESS ANY 7 / 2015/01/23 18:30:32 6.0 min 0.4 s 324.1 MB 68 23622 0 SUCCESS ANY 1 / 2015/01/23 18:30:32 5.7 min 0.5 s 329.9 MB 62 23616 0 SUCCESS ANY 6 / 2015/01/23 18:30:32 5.8 min 0.7 s 326.4 MB 61 23615 0 SUCCESS ANY 3 / 2015/01/23 18:30:32 5.5 min 1 s 335.7 MB 64 23618 0 SUCCESS ANY 2 / 2015/01/23 18:30:32 5.6 min 2 s 328.1 MB Then towards the end, when things start heading south, I see the following. These tasks never complete but you can see that they have taken more than 47 minutes (so far) before the job finally fails. Not really sure why. 671 24225 0 RUNNING ANY 1 / 2015/01/23 18:59:14 47 min 672 24226 0 RUNNING ANY 1 / 2015/01/23 18:59:14 47 min 673 24227 0 RUNNING ANY 1 / 2015/01/23 18:59:14 47 min 674 24228 0 RUNNING ANY 1 / 2015/01/23 18:59:14 47 min 675 24229 0 RUNNING ANY 1 / 2015/01/23 18:59:14 47 min 676 24230 0 RUNNING ANY 1 / 2015/01/23 18:59:14 47 min 677 24231 0 RUNNING ANY 1 / 2015/01/23 18:59:14 47 min 678 24232 0 RUNNING ANY 1 / 2015/01/23 18:59:14 47 min 679 24233 0 RUNNING ANY 1 / 2015/01/23 18:59:14 47 min 680 24234 0 RUNNING ANY 1 / 2015/01/23 18:59:17 47 min 681 24235 0 RUNNING ANY 1 / 2015/01/23 18:59:18 47 min 682 24236 0 RUNNING ANY 1 / 2015/01/23 18:59:18 47 min 683 24237 0 RUNNING ANY 5 / 2015/01/23 18:59:20 47 min 684 24238 0 RUNNING ANY 5 / 2015/01/23 18:59:20 47 min 685 24239 0 RUNNING ANY 5 / 2015/01/23 18:59:20 47 min 686 24240 0 RUNNING ANY 5 / 2015/01/23 18:59:20 47 min 687 24241 0 RUNNING ANY 5 / 2015/01/23 18:59:20 47 min 688 24242 0 RUNNING ANY 5 / 2015/01/23 18:59:20 47 min 689 24243 0 RUNNING ANY 5 / 2015/01/23 18:59:20 47 min 690 24244 0 RUNNING ANY 5 / 2015/01/23 18:59:20 47 min 691 24245 0 RUNNING ANY 5 / 2015/01/23 18:59:21 47 min What's odd is that even on the same machine (see below) some tasks are still completing (in less than 5 minutes) while other tasks on the same machine seem to be hung after 46 minutes. Keep in mind all I'm doing is saving the file to S3 so one would think the amount of work per task/partition would be fairly equal. 694 24248 0 SUCCESS ANY 0 / 2015/01/23 18:59:32 4.5 min 0.3 s 326.5 MB 695 24249 0 SUCCESS ANY 0 / 2015/01/23 18:59:32 4.5 min 0.3 s 330.8 MB 696 24250 0 RUNNING ANY 0 / 2015/01/23 18:59:32 46 min 697 24251 0 RUNNING ANY 0 / 2015/01/23 18:59:32 46 min 698 24252 0 SUCCESS ANY 0 / 2015/01/23 18:59:32 4.5 min 0.3 s 325.8 MB 699 24253 0 SUCCESS ANY 0 / 2015/01/23 18:59:32 4.5 min 0.3 s 325.2 MB 700 24254 0 SUCCESS ANY 0 / 2015/01/23 18:59:32 4.5 min 0.3 s 323.4 MB If anyone has some suggestions please let me know. I've tried playing around with various configuration options but I've found nothing yet that will fix the underlying issue. Thanks. Darin. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org