On 23.1.2015 00:40, Walaa Eldin Moustafa wrote:
Hi,
I am experimenting with a memory-intensive Giraph application on top
of a large graph (50 million nodes), on a 14 node cluster.
When setting the number of workers to a large number (500 in this
example), I get errors for not being able to fulfill the number of
requested workers (Please see the log excerpt below). To my
understanding, this contradicts with how Yarn/MR map tasks operate, as
if the number of map tasks is more than what is currently available in
terms of resources, only a subset of the maps are started, and new
ones are assigned as new slots become available. In other words, as
many map tasks as possible can run concurrently, and new ones are run
as resources become available. Is not this the case with Giraph
workers? I expect it to be the case, since workers are basically map
tasks, so the same should apply to them. However, the log below
suggests otherwise, as based on my resources, 37 map tasks (workers)
could be created, but the application could not proceed without
creating all the 500 workers. Could you please help explaining what is
causing this?
Hi,
Giraph is not standard M/R job. It needs all Mappers to run in same
moment. No computation is started before all mappers are running.
Its hard to tell it does not work. I guess you have already raised
timetout. Check if there is enough slots in queue where jobs is running,
give Giraph higher priority.
Lukas
Thanks,
Walaa.
Only found 37 responses of 500 needed to start superstep -1. Reporting
every 30000 msecs, 296929 more msecs left before giving up.
2015-01-20 01:29:49,007 ERROR [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: checkWorkers: Did not
receive enough processes in time (only 37 of 500 required) after
waiting 600000msecs). This occurs if you do not have enough map tasks
available simultaneously on your Hadoop instance to fulfill the number
of requested workers.
2015-01-20 01:29:49,015 FATAL [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: failJob: Killing job
job_1421703431598_0006
2015-01-20 01:29:49,015 FATAL [org.apache.giraph.master.MasterThread]
org.apache.giraph.master.BspServiceMaster: failJob: exception
java.lang.IllegalStateException: Not enough healthy workers to create
input splits