There are lots of suggestions to deal with that problem. First ones: - Decrease the number of workers, to 1 per node, for maximize the amount of RAM that each worker have. Xmx and Xms should be the same, this is a good practice in every java environment as fas as i know. - Put here the exact command that you are using for invoking giraph algorithm, so everyone here can help you. - I reccommend you to check how much memory is left when all your graph is loaded into workers. In the superstep -1 (minus one) your graph is loaded into memory. You should look how much memory take loading the graph, and check what memory is left for the rest of the supersteps. - Increase the logging level of your application. You can get more detailed information using giraph.logLevel=Debug (defaul Giraph logging level is info). - Enable isStaticGraph option, for stopping graph mutations that can increase your memory problems. At least until you have a clue of what is going on. - You should be using Giraph 1.2 that has a better support for out of core, instead of the previous 1.1 version. Are you using it?
Check this tips and tell us any new information. Bye -- *José Luis Larroque* Analista Programador Universitario - Facultad de Informática - UNLP Desarrollador Java y .NET en LIFIA 2017-03-02 7:01 GMT-03:00 Sai Ganesh Muthuraman <saiganesh...@gmail.com>: > Hi Jose, > > I went through the container logs and found that the following error was > happening > *java.lang.OutOfMemoryError: Java heap space* > This was probably causing the missing chosen workers error. > > This happens only when the graph size exceeds more than 50k vertices and > 100k edges. > I enabled out of core messaging, out of core computation and heap size is > 8GB which is reasonable given that I have 128 GB RAM in every node. > I tried increasing the number of workers and the number of nodes to 6. > Still the same result. > This happens in superstep 2 itself. > Any suggestions? > > Sai Ganesh > > > On Feb 27, 2017, at 10:27 PM, José Luis Larroque <user@giraph.apache.org> > wrote: > > Could be a lot of different reasons. Memory problems, algorithm problems, > etc. > > I recommend you to focus in reach the logs instead of guessing why the > worker's are dying. Maybe you are looking in the wrong place, maybe you can > access to them though web ui instead of command line. > > From terminal, doing yarn logs -applicationId "id" doing will be enough > for seeing them. If you want to access your phyisical files in your nodes, > you should go to all nodes and check everyone of them, and search for the > different containers of your application in the directory where those are. > > Another link with help: > http://stackoverflow.com/questions/32713587/how-to-keep-yarns-log-files. > > Maybe you could test the algorithm locally instead of running it on the > cluster, for a better understanding of the relation between yarn and Giraph. > > Bye > > -- > *José Luis Larroque* > Analista Programador Universitario - Facultad de Informática - UNLP > Desarrollador Java y .NET en LIFIA > > 2017-02-27 12:27 GMT-03:00 Sai Ganesh Muthuraman <saiganesh...@gmail.com>: > > Hi, > > The first container in the application logs usually contains the gam logs. > But the first container logs are not available. Hence no gam logs. > What could be the possible reasons for the dying of some workers? > > > Sai Ganesh > > > > On Feb 25, 2017, at 9:30 PM, José Luis Larroque <user@giraph.apache.org> > wrote: > > You are probably looking at your giraph application manager (gam) logs. > You should look for your workers logs, each one have a log (container's > logs). If you can't find them, you should look at your yarn configuration > in order to know where are them, see this: http://stackoverflow.com/quest > ions/21621755/where-does-hadoop-store-the-logs-of-yarn-applications. > > I don't recommend you to enable checkpointing until you now the specific > error that you are facing. If you are facing out of memory errors for > example, checkpointing won't be helpful in my experience, the same error > will happen over and over. > > -- > *José Luis Larroque* > Analista Programador Universitario - Facultad de Informática - UNLP > Desarrollador Java y .NET en LIFIA > > 2017-02-25 12:38 GMT-03:00 Sai Ganesh Muthuraman <saiganesh...@gmail.com>: > > Hi Jose, > > Which logs do I have to look into exactly, because in the application > logs, I found the error message that I mentioned and it was also mentioned > that there was *No good last checkpoint.* > I am not able to figure out the reason for the failure of a worker for > bigger files. What do I have to look for in the logs? > Also, How do I enable Checkpointing? > > > - Sai Ganesh Muthuraman > > > > > > > > > >