Hi Ganesh, For some reason, some of your workers are dying. When that happens, giraph automatically detects that the amount of workers is below neccesary on " barrierOnWorkerList" and search if a checkpoint exists (a checkpoint is a backup of the state of a Giraph application). You don't have checkpointing enabled apparently, so the entire job is being killed. I reccomend that you look in your containers logs and try to detect why one or more workers are dying when you have bigger files.
Bye! -- *José Luis Larroque* Analista Programador Universitario - Facultad de Informática - UNLP Desarrollador Java y .NET en LIFIA 2017-02-25 3:24 GMT-03:00 Sai Ganesh Muthuraman <saiganesh...@gmail.com>: > Hi, > > I used one worker per node and that worked for smaller files. When the > file size was more than 25 MB, I got this strange exception. I tried using > 2 nodes and 3 nodes, the result is the same. > > *ERROR* [org.apache.giraph.master.MasterThread] master.BspServiceMaster > (BspServiceMaster.java:barrierOnWorkerList(1415)) - barrierOnWorkerList:* > Missing chosen workers *[Worker(hostname=comet-10-68.sdsc.edu, > MRtaskID=2, port=30002)] on superstep 2 > *FATAL* [org.apache.giraph.master.MasterThread] master.BspServiceMaster > (BspServiceMaster.java:getLastGoodCheckpoint(1291)) - > getLastGoodCheckpoint: No last good checkpoints can be found, killing the > job. > java.io.FileNotFoundException: File hdfs://comet-10-33.ibnet: > 54310/user/saiganes/_bsp/_checkpoints/giraph_yarn_application_1488002378889_0001 > does not exist. > at org.apache.hadoop.hdfs.DistributedFileSystem. > listStatusInternal(DistributedFileSystem.java:697) > at org.apache.hadoop.hdfs.DistributedFileSystem.access$ > 600(DistributedFileSystem.java:105) > at org.apache.hadoop.hdfs.DistributedFileSystem$15. > doCall(DistributedFileSystem.java:755) > at org.apache.hadoop.hdfs.DistributedFileSystem$15. > doCall(DistributedFileSystem.java:751) > at org.apache.hadoop.fs.FileSystemLinkResolver.resolve( > FileSystemLinkResolver.java:81) > at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus( > DistributedFileSystem.java:751) > at org.apache.hadoop.fs.FileSystem.listStatus( > FileSystem.java:1485) > at org.apache.hadoop.fs.FileSystem.listStatus( > FileSystem.java:1525) > at org.apache.giraph.utils.CheckpointingUtils. > getLastCheckpointedSuperstep(CheckpointingUtils.java:107) > at org.apache.giraph.bsp.BspService.getLastCheckpointedSuperstep( > BspService.java:1196) > at org.apache.giraph.master.BspServiceMaster. > getLastGoodCheckpoint(BspServiceMaster.java:1289) > at org.apache.giraph.master.MasterThread.run(MasterThread. > java:149) > > > - Sai Ganesh > >