Hi all;

I have a 10 machine cluster; 8 Tasktrackers, 1 namenode and 1 Jobtracker. All of them have 8GB memory. When I run a giraph job with 16 workers on a hbase table with 15 millions of rows (12GB of data) my job does not finish due to 2 (sometimes 3) workers stuck at superstep -1. Below you can see the logs of these workers. Any helps/ideas?
Thanks
Sertug

*_syslog logs_*
2013-11-08 19:00:06,576 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded 
the native-hadoop library
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:zookeeper.version=3.3.3-1073969, built on 02/23/2011 22:27 GMT
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:host.name=host-10-6-149-104
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.version=1.7.0_21
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.vendor=Oracle Corporation
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.home=/usr/java/jdk1.7.0_21/jre
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.class.path=/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/jars/classes:/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/jars:/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/attempt_201311081157_0003_m_000009_0/work:/usr/local/hadoop-1.2.1/libexec/../conf:/usr/java/default/lib/tools.jar:/usr/local/hadoop-1.2.1/libexec/..:/usr/local/hadoop-1.2.1/libexec/../hadoop-core-1.2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/asm-3.2.jar:/usr/local/hadoop-1.2.1/libexec/../lib/aspectjrt-1.6.11.jar:/usr/local/hadoop-1.2.1/libexec/../lib/aspectjtools-1.6.11.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-beanutils-1.7.0.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-beanutils-core-1.8.0.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-cli-1.2.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-codec-1.4.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-collections-3.2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-configuration-1.6.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-daemon-1.0.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-digester-1.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-el-1.0.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-httpclient-3.0.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-io-2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-lang-2.4.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-logging-1.1.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-logging-api-1.0.4.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-math-2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/commons-net-3.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/core-3.1.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/giraph-nutch-1.1.0-SNAPSHOT-for-hadoop-0.20.203.0-jar-with-dependencies.jar:/usr/local/hadoop-1.2.1/libexec/../lib/hadoop-capacity-scheduler-1.2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/hadoop-fairscheduler-1.2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/hadoop-thriftfs-1.2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/hsqldb-1.8.0.10.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jackson-core-asl-1.8.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jackson-mapper-asl-1.8.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jasper-compiler-5.5.12.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jasper-runtime-5.5.12.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jdeb-0.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jersey-core-1.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jersey-json-1.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jersey-server-1.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jets3t-0.6.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jetty-6.1.26.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jetty-util-6.1.26.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jsch-0.1.42.jar:/usr/local/hadoop-1.2.1/libexec/../lib/junit-4.5.jar:/usr/local/hadoop-1.2.1/libexec/../lib/kfs-0.2.2.jar:/usr/local/hadoop-1.2.1/libexec/../lib/log4j-1.2.15.jar:/usr/local/hadoop-1.2.1/libexec/../lib/mockito-all-1.8.5.jar:/usr/local/hadoop-1.2.1/libexec/../lib/oro-2.0.8.jar:/usr/local/hadoop-1.2.1/libexec/../lib/servlet-api-2.5-20081211.jar:/usr/local/hadoop-1.2.1/libexec/../lib/slf4j-api-1.4.3.jar:/usr/local/hadoop-1.2.1/libexec/../lib/slf4j-log4j12-1.4.3.jar:/usr/local/hadoop-1.2.1/libexec/../lib/xmlenc-0.52.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-2.1.jar:/usr/local/hadoop-1.2.1/libexec/../lib/jsp-2.1/jsp-api-2.1.jar
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.library.path=/usr/local/hadoop-1.2.1/libexec/../lib/native/Linux-amd64-64:/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/attempt_201311081157_0003_m_000009_0/work
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.io.tmpdir=/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/attempt_201311081157_0003_m_000009_0/work/tmp
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:java.compiler=<NA>
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:os.name=Linux
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:os.arch=amd64
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:os.version=2.6.32-358.18.1.el6.x86_64
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:user.name=root
2013-11-08 19:00:07,449 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:user.home=/root
2013-11-08 19:00:07,450 INFO org.apache.zookeeper.ZooKeeper: Client 
environment:user.dir=/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/attempt_201311081157_0003_m_000009_0/work
2013-11-08 19:00:07,450 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, connectString=host-10-6-149-98:2181 sessionTimeout=180000 
watcher=hconnection
2013-11-08 19:00:07,465 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server host-10-6-149-98/10.6.149.98:2181
2013-11-08 19:00:07,469 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to host-10-6-149-98/10.6.149.98:2181, initiating session
2013-11-08 19:00:07,475 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server host-10-6-149-98/10.6.149.98:2181, sessionid = 
0x142380291760046, negotiated timeout = 40000
2013-11-08 19:00:07,663 INFO 
org.apache.hadoop.hbase.mapreduce.TableOutputFormat: Created table instance for 
host
2013-11-08 19:00:07,674 INFO org.apache.hadoop.util.ProcessTree: setsid exited 
with exit code 0
2013-11-08 19:00:07,683 INFO org.apache.hadoop.mapred.Task:  Using 
ResourceCalculatorPlugin : 
org.apache.hadoop.util.LinuxResourceCalculatorPlugin@271b72b5
2013-11-08 19:00:07,762 INFO org.apache.hadoop.mapred.MapTask: Processing 
split: 'org.apache.giraph.bsp.BspInputSplit, index=-1, num=-1
2013-11-08 19:00:07,806 INFO org.apache.giraph.graph.GraphTaskManager: setup: 
Log level remains at info
2013-11-08 19:00:07,891 INFO org.apache.giraph.graph.GraphTaskManager: 
Distributed cache is empty. Assuming fatjar.
2013-11-08 19:00:07,892 INFO org.apache.giraph.graph.GraphTaskManager: setup: 
classpath @ 
/hadoop/mapred/taskTracker/root/jobcache/job_201311081157_0003/jars/job.jar for 
job org.apache.nutch.scoring.HostRankGiraphJob
2013-11-08 19:00:07,892 INFO org.apache.giraph.graph.GraphTaskManager: setup: 
Starting up BspServiceWorker...
2013-11-08 19:00:07,906 INFO org.apache.giraph.bsp.BspService: BspService: Path 
to create to halt is /_hadoopBsp/job_201311081157_0003/_haltComputation
2013-11-08 19:00:07,906 INFO org.apache.giraph.bsp.BspService: BspService: 
Connecting to ZooKeeper with job job_201311081157_0003, 9 on 
host-10-6-149-98:2181
2013-11-08 19:00:07,907 INFO org.apache.zookeeper.ZooKeeper: Initiating client 
connection, connectString=host-10-6-149-98:2181 sessionTimeout=60000 
watcher=org.apache.giraph.worker.BspServiceWorker@7d766a1d
2013-11-08 19:00:07,918 INFO org.apache.zookeeper.ClientCnxn: Opening socket 
connection to server host-10-6-149-98/10.6.149.98:2181
2013-11-08 19:00:07,919 INFO org.apache.zookeeper.ClientCnxn: Socket connection 
established to host-10-6-149-98/10.6.149.98:2181, initiating session
2013-11-08 19:00:07,922 INFO org.apache.zookeeper.ClientCnxn: Session 
establishment complete on server host-10-6-149-98/10.6.149.98:2181, sessionid = 
0x142380291760055, negotiated timeout = 40000
2013-11-08 19:00:07,922 INFO org.apache.giraph.bsp.BspService: process: 
Asynchronous connection complete.
2013-11-08 19:00:08,004 INFO org.apache.giraph.comm.netty.NettyServer: 
NettyServer: Using execution handler with 8 threads after requestFrameDecoder.
2013-11-08 19:00:08,042 INFO org.apache.giraph.comm.netty.NettyServer: start: 
Started server communication server: host-10-6-149-104/10.6.149.104:30009 with 
up to 16 threads on bind attempt 0 with sendBufferSize = 32768 
receiveBufferSize = 524288 backlog = 15
2013-11-08 19:00:08,047 INFO org.apache.giraph.comm.netty.NettyClient: 
NettyClient: Using execution handler with 8 threads after requestEncoder.
2013-11-08 19:00:08,058 INFO org.apache.giraph.graph.GraphTaskManager: setup: 
Registering health of this worker...
2013-11-08 19:00:08,078 INFO org.apache.giraph.bsp.BspService: getJobState: Job 
state already exists (/_hadoopBsp/job_201311081157_0003/_masterJobState)
2013-11-08 19:00:08,083 INFO org.apache.giraph.bsp.BspService: 
getApplicationAttempt: Node 
/_hadoopBsp/job_201311081157_0003/_applicationAttemptsDir already exists!
2013-11-08 19:00:08,087 INFO org.apache.giraph.bsp.BspService: 
getApplicationAttempt: Node 
/_hadoopBsp/job_201311081157_0003/_applicationAttemptsDir already exists!
2013-11-08 19:00:08,092 INFO org.apache.giraph.worker.BspServiceWorker: 
registerHealth: Created my health node for attempt=0, superstep=-1 with 
/_hadoopBsp/job_201311081157_0003/_applicationAttemptsDir/0/_superstepDir/-1/_workerHealthyDir/host-10-6-149-104_9
 and workerInfo= Worker(hostname=host-10-6-149-104, MRtaskID=9, port=30009)
2013-11-08 19:00:08,100 INFO org.apache.giraph.worker.BspServiceWorker: 
startSuperstep: Master(hostname=host-10-6-149-103, MRtaskID=0, port=30000)
2013-11-08 19:00:08,100 INFO org.apache.giraph.worker.BspServiceWorker: 
startSuperstep: Ready for computation on superstep -1 since worker selection 
and vertex range assignments are done in 
/_hadoopBsp/job_201311081157_0003/_applicationAttemptsDir/0/_superstepDir/-1/_addressesAndPartitions
2013-11-08 19:00:08,101 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,115 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,120 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,121 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,121 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,127 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,129 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,131 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,131 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,131 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,132 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,132 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,132 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,132 INFO org.apache.giraph.comm.netty.NettyClient: Using 
Netty without authentication.
2013-11-08 19:00:08,139 INFO org.apache.giraph.comm.netty.NettyClient: 
connectAllAddresses: Successfully added 14 connections, (14 total connected) 0 
failed, 0 failures total.
2013-11-08 19:00:18,154 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:00:38,155 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:00:58,155 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:01:08,156 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:01:28,156 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:01:38,157 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:01:58,157 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:02:08,158 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:02:28,158 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:02:32,555 WARN org.apache.giraph.bsp.BspService: process: Unknown 
and unprocessed event 
(path=/_hadoopBsp/job_201311081157_0003/_applicationAttemptsDir/0/_superstepDir,
 type=NodeChildrenChanged, state=SyncConnected)
2013-11-08 19:02:38,159 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:02:58,159 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:03:08,160 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:03:28,160 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:03:38,161 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:03:58,161 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:04:08,162 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:04:28,162 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:04:38,163 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:04:58,163 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:05:08,164 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:05:28,164 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:05:38,165 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:05:58,165 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:06:08,166 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:06:28,166 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:06:38,167 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:06:58,167 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:07:08,168 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:07:26,593 INFO org.apache.giraph.worker.BspServiceWorker: 
processEvent: Job state changed, checking to see if it needs to restart
2013-11-08 19:07:26,605 INFO org.apache.giraph.bsp.BspService: getJobState: Job 
state already exists (/_hadoopBsp/job_201311081157_0003/_masterJobState)
2013-11-08 19:07:28,168 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:07:38,169 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:07:58,169 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:08:08,170 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]
2013-11-08 19:08:28,170 INFO org.apache.giraph.utils.TaskIdsPermitsBarrier: 
waitForRequiredPermits: Waiting for 1 more tasks to send their aggregator data, 
task ids: [-1]

Reply via email to