Sorry. I should have sent it to the hadoop list. We have got the issue resolved. The issue was: earlier hadoop was picking up <dfs.tmp.dir>/dfs/data as the dfs dir. Later when we specified the <dfs.data.dir> property in the config, hadoop did not append /dfs/data to the path and the datanode was looking for block in the <dfs.data.dir>. We changed the path to include /dfs/data and it worked fine.
regards, ./Prem On Mon, Jan 7, 2013 at 2:53 PM, prem yadav <ipremya...@gmail.com> wrote: > Hi, > > We have been running hadoop without much issues for some time. Today we > has a problem where the datanodes has their disks full and the cluster > stopped working. > We fixed things, modified the config to add directories to dfs.data.dir > and restarted. > > The hadoop version is 1.0.4. > > The issue is: > the datanodes are not sending any block reports. No errors in the logs. > The namenode shows there are 6 datanodes but never leaves the safe mode and > the report ratio never goes up from 0.000. > > On one of the slave the jstack logs are: > > 2013-01-07 09:13:04 > Full thread dump Java HotSpot(TM) 64-Bit Server VM (23.5-b02 mixed mode): > > "Attach Listener" daemon prio=10 tid=0x00007f40f0766800 nid=0x6268 waiting > on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "org.apache.hadoop.hdfs.server.datanode.DataBlockScanner@207a0c69" daemon > prio=10 tid=0x00007f40e001a000 nid=0x5f52 waiting on condition > [0x00007f40d9219000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner.run(DataBlockScanner.java:620) > at java.lang.Thread.run(Thread.java:722) > > "IPC Server handler 2 on 50020" daemon prio=10 tid=0x00007f40e0017800 > nid=0x5f51 waiting on condition [0x00007f40d931a000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000eedc95b8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1364) > > "IPC Server handler 1 on 50020" daemon prio=10 tid=0x00007f40e0015000 > nid=0x5f50 waiting on condition [0x00007f40d941b000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000eedc95b8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1364) > > "IPC Server handler 0 on 50020" daemon prio=10 tid=0x00007f40e0013000 > nid=0x5f4f waiting on condition [0x00007f40d951c000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000000eedc95b8> (a > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2043) > at > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1364) > > "IPC Server listener on 50020" daemon prio=10 tid=0x00007f40e000a000 > nid=0x5f4e runnable [0x00007f40d961d000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) > - locked <0x00000000eeda0720> (a sun.nio.ch.Util$2) > - locked <0x00000000eeda0710> (a java.util.Collections$UnmodifiableSet) > - locked <0x00000000eeda04d0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102) > at org.apache.hadoop.ipc.Server$Listener.run(Server.java:439) > > "IPC Server Responder" daemon prio=10 tid=0x00007f40e0008800 nid=0x5f4d > runnable [0x00007f40d971e000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) > - locked <0x00000000eedc99e0> (a sun.nio.ch.Util$2) > - locked <0x00000000eedc99d0> (a java.util.Collections$UnmodifiableSet) > - locked <0x00000000eedc97b0> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) > at org.apache.hadoop.ipc.Server$Responder.run(Server.java:605) > > "org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@75a61582" > daemon prio=10 tid=0x00007f40e0007000 nid=0x5f4c runnable > [0x00007f40d981f000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method) > at > sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:226) > - locked <0x00000000eeddb870> (a java.lang.Object) > at sun.nio.ch.ServerSocketAdaptor.accept(ServerSocketAdaptor.java:99) > - locked <0x00000000eeddb838> (a java.lang.Object) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:131) > at java.lang.Thread.run(Thread.java:722) > > "DataNode: > [/data/hadoopfs,/data1/hadoopfs,/data2/hadoopfs,/data3/hadoopfs]" daemon > prio=10 tid=0x00007f40f0761000 nid=0x5f4b in Object.wait() > [0x00007f40d9920000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000eeddb4f8> (a java.util.LinkedList) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:1023) > - locked <0x00000000eeddb4f8> (a java.util.LinkedList) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1458) > at java.lang.Thread.run(Thread.java:722) > > "pool-1-thread-1" prio=10 tid=0x00007f40f075d800 nid=0x5f4a runnable > [0x00007f40d9a21000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) > - locked <0x00000000eeda0d40> (a sun.nio.ch.Util$2) > - locked <0x00000000eeda0d30> (a java.util.Collections$UnmodifiableSet) > - locked <0x00000000eeda0b00> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:102) > at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:333) > - locked <0x00000000eeda0ae8> (a > org.apache.hadoop.ipc.Server$Listener$Reader) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > > "Timer-0" daemon prio=10 tid=0x00007f40f019c800 nid=0x5f49 in > Object.wait() [0x00007f40d9d69000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000eede50c0> (a java.util.TaskQueue) > at java.util.TimerThread.mainLoop(Timer.java:552) > - locked <0x00000000eede50c0> (a java.util.TaskQueue) > at java.util.TimerThread.run(Timer.java:505) > > "611753678@qtp-1701186867-1 - Acceptor0 > SelectChannelConnector@0.0.0.0:50075" prio=10 tid=0x00007f40f0653000 > nid=0x5f48 runnable [0x00007f40d9e6a000] > java.lang.Thread.State: RUNNABLE > at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) > at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228) > at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81) > at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) > - locked <0x00000000eee000f0> (a sun.nio.ch.Util$2) > - locked <0x00000000eee00100> (a java.util.Collections$UnmodifiableSet) > - locked <0x00000000eee000a8> (a sun.nio.ch.EPollSelectorImpl) > at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) > at > org.mortbay.io.nio.SelectorManager$SelectSet.doSelect(SelectorManager.java:498) > at org.mortbay.io.nio.SelectorManager.doSelect(SelectorManager.java:192) > at > org.mortbay.jetty.nio.SelectChannelConnector.accept(SelectChannelConnector.java:124) > at > org.mortbay.jetty.AbstractConnector$Acceptor.run(AbstractConnector.java:708) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) > > "1261953562@qtp-1701186867-0" prio=10 tid=0x00007f40f0651800 nid=0x5f47 > in Object.wait() [0x00007f40d9f6b000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000eede8068> (a > org.mortbay.thread.QueuedThreadPool$PoolThread) > at > org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:626) > - locked <0x00000000eede8068> (a > org.mortbay.thread.QueuedThreadPool$PoolThread) > > "Async Block Report Generator" daemon prio=10 tid=0x00007f40f05ec000 > nid=0x5f46 in Object.wait() [0x00007f40da06c000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000eeddaed0> (a > org.apache.hadoop.hdfs.server.datanode.FSDataset$AsyncBlockReport) > at > org.apache.hadoop.hdfs.server.datanode.FSDataset$AsyncBlockReport.waitForReportRequest(FSDataset.java:2254) > - locked <0x00000000eeddaed0> (a > org.apache.hadoop.hdfs.server.datanode.FSDataset$AsyncBlockReport) > at > org.apache.hadoop.hdfs.server.datanode.FSDataset$AsyncBlockReport.run(FSDataset.java:2224) > at java.lang.Thread.run(Thread.java:722) > > "refreshUsed-/data3/hadoopfs" daemon prio=10 tid=0x00007f40f05e7000 > nid=0x5f45 waiting on condition [0x00007f40da16d000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:80) > at java.lang.Thread.run(Thread.java:722) > > "refreshUsed-/data2/hadoopfs" daemon prio=10 tid=0x00007f40f05e5800 > nid=0x5f42 waiting on condition [0x00007f40e41d7000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:80) > at java.lang.Thread.run(Thread.java:722) > > "refreshUsed-/data1/hadoopfs" daemon prio=10 tid=0x00007f40f05e4800 > nid=0x5f3f waiting on condition [0x00007f40e42d8000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:80) > at java.lang.Thread.run(Thread.java:722) > > "refreshUsed-/data/hadoopfs" daemon prio=10 tid=0x00007f40f05df000 > nid=0x5f3c waiting on condition [0x00007f40e43d9000] > java.lang.Thread.State: TIMED_WAITING (sleeping) > at java.lang.Thread.sleep(Native Method) > at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:80) > at java.lang.Thread.run(Thread.java:722) > > "IPC Client (47) connection to master:54310 from hadoop" daemon prio=10 > tid=0x00007f40f05bd000 nid=0x5f39 in Object.wait() [0x00007f40e44da000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000eedca5f0> (a > org.apache.hadoop.ipc.Client$Connection) > at org.apache.hadoop.ipc.Client$Connection.waitForWork(Client.java:706) > - locked <0x00000000eedca5f0> (a org.apache.hadoop.ipc.Client$Connection) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:748) > > "Timer for 'DataNode' metrics system" daemon prio=10 > tid=0x00007f40f0509800 nid=0x5f27 in Object.wait() [0x00007f40e4804000] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000eedf86d0> (a java.util.TaskQueue) > at java.util.TimerThread.mainLoop(Timer.java:552) > - locked <0x00000000eedf86d0> (a java.util.TaskQueue) > at java.util.TimerThread.run(Timer.java:505) > > "ganglia" daemon prio=10 tid=0x00007f40f0507000 nid=0x5f26 in > Object.wait() [0x00007f40e4905000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000eedf8790> (a > org.apache.hadoop.metrics2.impl.SinkQueue) > at java.lang.Object.wait(Object.java:503) > at > org.apache.hadoop.metrics2.impl.SinkQueue.waitForData(SinkQueue.java:109) > - locked <0x00000000eedf8790> (a > org.apache.hadoop.metrics2.impl.SinkQueue) > at org.apache.hadoop.metrics2.impl.SinkQueue.consumeAll(SinkQueue.java:78) > at > org.apache.hadoop.metrics2.impl.MetricsSinkAdapter.publishMetricsFromQueue(MetricsSinkAdapter.java:113) > at > org.apache.hadoop.metrics2.impl.MetricsSinkAdapter$2.run(MetricsSinkAdapter.java:89) > > "RMI TCP Accept-0" daemon prio=10 tid=0x00007f40f0350000 nid=0x5f23 > runnable [0x00007f40e4d0d000] > java.lang.Thread.State: RUNNABLE > at java.net.PlainSocketImpl.socketAccept(Native Method) > at > java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:398) > at java.net.ServerSocket.implAccept(ServerSocket.java:522) > at java.net.ServerSocket.accept(ServerSocket.java:490) > at > sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:52) > at > sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:387) > at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:359) > at java.lang.Thread.run(Thread.java:722) > > "Service Thread" daemon prio=10 tid=0x00007f40f00f1000 nid=0x5f22 runnable > [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "C2 CompilerThread1" daemon prio=10 tid=0x00007f40f00ee800 nid=0x5f21 > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "C2 CompilerThread0" daemon prio=10 tid=0x00007f40f00eb800 nid=0x5f20 > waiting on condition [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "Signal Dispatcher" daemon prio=10 tid=0x00007f40f00e9800 nid=0x5f1f > runnable [0x0000000000000000] > java.lang.Thread.State: RUNNABLE > > "Finalizer" daemon prio=10 tid=0x00007f40f009c800 nid=0x5f1e in > Object.wait() [0x00007f40e5d2d000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000eecd1208> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:135) > - locked <0x00000000eecd1208> (a java.lang.ref.ReferenceQueue$Lock) > at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:151) > at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:177) > > "Reference Handler" daemon prio=10 tid=0x00007f40f009a800 nid=0x5f1d in > Object.wait() [0x00007f40e5e2e000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000eecd0d90> (a java.lang.ref.Reference$Lock) > at java.lang.Object.wait(Object.java:503) > at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:133) > - locked <0x00000000eecd0d90> (a java.lang.ref.Reference$Lock) > > "main" prio=10 tid=0x00007f40f0009800 nid=0x5f17 in Object.wait() > [0x00007f40f5dce000] > java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00000000eedf8570> (a java.lang.Thread) > at java.lang.Thread.join(Thread.java:1258) > - locked <0x00000000eedf8570> (a java.lang.Thread) > at java.lang.Thread.join(Thread.java:1332) > at org.apache.hadoop.hdfs.server.datanode.DataNode.join(DataNode.java:1547) > at > org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1667) > at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1682) > > "VM Thread" prio=10 tid=0x00007f40f0093000 nid=0x5f1c runnable > > "GC task thread#0 (ParallelGC)" prio=10 tid=0x00007f40f0017800 nid=0x5f18 > runnable > > "GC task thread#1 (ParallelGC)" prio=10 tid=0x00007f40f0019000 nid=0x5f19 > runnable > > "GC task thread#2 (ParallelGC)" prio=10 tid=0x00007f40f001b000 nid=0x5f1a > runnable > > "GC task thread#3 (ParallelGC)" prio=10 tid=0x00007f40f001d000 nid=0x5f1b > runnable > > "VM Periodic Task Thread" prio=10 tid=0x00007f40f0376000 nid=0x5f24 > waiting on condition > > JNI global references: 216 > > > > Any help would be great. Right now, I am not even sure where to look for > issues. > > regards. >