There are lots of suggestions to deal with that problem.

First ones:
- Decrease the number of workers, to 1 per node, for maximize the amount of
RAM that each worker have. Xmx and Xms should be the same, this is a good
practice in every java environment as fas as i know.
- Put here the exact command that you are using for invoking giraph
algorithm, so everyone here can help you.
- I reccommend you to check how much memory is left when all your graph is
loaded into workers. In the superstep -1 (minus one) your graph is loaded
into memory. You should look how much memory take loading the graph, and
check what memory is left for the rest of the supersteps.
- Increase the logging level of your application. You can get more detailed
information using giraph.logLevel=Debug (defaul Giraph logging level is
info).
- Enable isStaticGraph option, for stopping graph mutations that can
increase your memory problems. At least until you have a clue of what is
going on.
- You should be using Giraph 1.2 that has a better support for out of core,
instead of the previous 1.1 version. Are you using it?

Check this tips and tell us any new information.

Bye

-- 
*José Luis Larroque*
Analista Programador Universitario - Facultad de Informática - UNLP
Desarrollador Java y .NET  en LIFIA

2017-03-02 7:01 GMT-03:00 Sai Ganesh Muthuraman <saiganesh...@gmail.com>:

> Hi Jose,
>
> I went through the container logs and found that the following error was
> happening
> *java.lang.OutOfMemoryError: Java heap space*
> This was probably causing the missing chosen workers error.
>
> This happens only when the graph size exceeds more than 50k vertices and
> 100k edges.
> I enabled out of core messaging, out of core computation and heap size is
> 8GB which is reasonable given that I have 128 GB RAM in every node.
> I tried increasing the number of workers and the number of nodes to 6.
> Still the same result.
> This happens in superstep 2 itself.
> Any suggestions?
>
> Sai Ganesh
>
>
> On Feb 27, 2017, at 10:27 PM, José Luis Larroque <user@giraph.apache.org>
> wrote:
>
> Could be a lot of different reasons. Memory problems, algorithm problems,
> etc.
>
> I recommend you to focus in reach the logs instead of guessing why the
> worker's are dying. Maybe you are looking in the wrong place, maybe you can
> access to them though web ui instead of command line.
>
> From terminal, doing yarn logs -applicationId "id" doing will be enough
> for seeing them. If you want to access your phyisical files in your nodes,
> you should go to all nodes and check everyone of them, and search for the
> different containers of your application in the directory where those are.
>
> Another link with help:
> http://stackoverflow.com/questions/32713587/how-to-keep-yarns-log-files.
>
> Maybe you could test the algorithm locally instead of running it on the
> cluster, for a better understanding of the relation between yarn and Giraph.
>
> Bye
>
> --
> *José Luis Larroque*
> Analista Programador Universitario - Facultad de Informática - UNLP
> Desarrollador Java y .NET  en LIFIA
>
> 2017-02-27 12:27 GMT-03:00 Sai Ganesh Muthuraman <saiganesh...@gmail.com>:
>
> Hi,
>
> The first container in the application logs usually contains the gam logs.
> But the first container logs are not available. Hence no gam logs.
> What could be the possible reasons for the dying of some workers?
>
>
> Sai Ganesh
>
>
>
> On Feb 25, 2017, at 9:30 PM, José Luis Larroque <user@giraph.apache.org>
> wrote:
>
> You are probably looking at your giraph application manager (gam) logs.
> You should look for your workers logs, each one have a log (container's
> logs). If you can't find them, you should look at your yarn configuration
> in order to know where are them, see this: http://stackoverflow.com/quest
> ions/21621755/where-does-hadoop-store-the-logs-of-yarn-applications.
>
> I don't recommend you to enable checkpointing until you now the specific
> error that you are facing. If you are facing out of memory errors for
> example, checkpointing won't be helpful in my experience, the same error
> will happen over and over.
>
> --
> *José Luis Larroque*
> Analista Programador Universitario - Facultad de Informática - UNLP
> Desarrollador Java y .NET  en LIFIA
>
> 2017-02-25 12:38 GMT-03:00 Sai Ganesh Muthuraman <saiganesh...@gmail.com>:
>
> Hi Jose,
>
> Which logs do I have to look into exactly, because in the application
> logs, I found the error message that I mentioned and it was also mentioned
> that there was *No good last checkpoint.*
> I am not able to figure out the reason for the failure of a worker for
> bigger files. What do I have to look for in the logs?
> Also, How do I enable Checkpointing?
>
>
> - Sai Ganesh Muthuraman
>
>
>
>
>
>
>
>
>
>

Reply via email to