Hi folks, We are successfully able to run Giraph for 1B vertices and around 20B edges in our cluster. This is great. But when we run it over 5B vertices over the actual data and around 50B edges we see some issues in the final step while offloading the partitions. Since the dataset is huge for our cluster, we are using giraph.useOutOfCoreGraph and giraph.useOutOfCoreMessages to spill the data when overloaded.With this setup all the supersteps finished within around 4 hours. But in the final step after reporting saving vertices in task status, it hangs after writing a few partitions, it is happening consistently in our case. I played with all the config params and nothing is helping out, any suggestions from you will be really helpful. Thanks a lot.
The log snippet: 2013-10-14 10:24:20,144 INFO org.apache.giraph.worker.BspServiceWorker: saveVertices: Starting to save 26146422 vertices 2013-10-14 10:24:20,183 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition vertices 1922 to /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-1922_vertices 2013-10-14 10:24:20,307 WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201310130212_0013/_applicationAttemptsDir/0/_superstepDir/15/_addressesAndPartitions, type=NodeDeleted, state=SyncConnected) 2013-10-14 10:24:20,431 WARN org.apache.giraph.bsp.BspService: process: Unknown and unprocessed event (path=/_hadoopBsp/job_201310130212_0013/_applicationAttemptsDir/0/_superstepDir/15/_superstepFinished, type=NodeDeleted, state=SyncConnected) 2013-10-14 10:24:20,555 INFO org.apache.giraph.worker.BspServiceWorker: processEvent: Job state changed, checking to see if it needs to restart 2013-10-14 10:24:20,640 INFO org.apache.giraph.bsp.BspService: getJobState: Job state already exists (/_hadoopBsp/job_201310130212_0013/_masterJobState) 2013-10-14 10:24:22,928 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition vertices 13762 to /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-13762_vertices 2013-10-14 10:24:27,648 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition vertices 23682 to /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-23682_vertices 2013-10-14 10:24:30,557 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition vertices 14882 to /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-14882_vertices 2013-10-14 10:24:32,935 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition vertices 11842 to /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-11842_vertices 2013-10-14 10:24:33,714 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition vertices 962 to /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-962_vertices 2013-10-14 10:24:35,184 INFO org.apache.giraph.worker.BspServiceWorker: saveVertices: Saved 978047 out of 26146422 vertices, on partition 5 out of 160 2013-10-14 10:24:35,187 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition vertices 22722 to /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-22722_vertices 2013-10-14 10:24:37,276 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition vertices 21762 to /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-21762_vertices 2013-10-14 10:24:39,868 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition vertices 11362 to /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-11362_vertices 2013-10-14 10:24:41,391 INFO org.apache.giraph.partition.DiskBackedPartitionStore: offloadPartition: writing partition vertices 482 to /mnt/diskg/mapred/local/taskTracker/sundi133/jobcache/job_201310130212_0013/attempt_201310130212_0013_m_000060_0/work/_bsp/_partitions/job_201310130212_0013/partition-482_vertices ------------------------------ *The error show in the job failure page for each attempt* FAILED Task attempt_201310130212_0013_m_000001_0 failed to report status for 7200 seconds. Killing! -- Best Regards, Jyotirmoy Sundi Data Engineer, Admobius San Francisco, CA 94158