-Nick
2009-09-29 09:15:53
Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01
mixed mode):
"263851...@qtp0-1" prio=10 tid=0x00002aaaf846a000 nid=0x226b in
Object.wait() [0x0000000041d24000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaade3587f8> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:565)
- locked <0x00002aaade3587f8> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)
"1837007...@qtp0-0" prio=10 tid=0x00002aaaf84d4000 nid=0x226a in
Object.wait() [0x0000000041b22000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaade3592b8> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:565)
- locked <0x00002aaade3592b8> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)
"refreshUsed-/tmp/hadoop-root/dfs/data" daemon prio=10
tid=0x00002aaaf8456000 nid=0x2269 waiting on condition
[0x0000000041c23000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:80)
at java.lang.Thread.run(Thread.java:619)
"RMI TCP Accept-0" daemon prio=10 tid=0x00002aaaf834d800 nid=0x225a
runnable [0x000000004171e000]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
- locked <0x00002aaade358040> (a java.net.SocksSocketImpl)
at java.net.ServerSocket.implAccept(ServerSocket.java:453)
at java.net.ServerSocket.accept(ServerSocket.java:421)
at
sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
at java.lang.Thread.run(Thread.java:619)
"Low Memory Detector" daemon prio=10 tid=0x00000000535f5000
nid=0x2259
runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"CompilerThread1" daemon prio=10 tid=0x00000000535f1800
nid=0x2258 waiting
on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"CompilerThread0" daemon prio=10 tid=0x00000000535ef000
nid=0x2257 waiting
on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Signal Dispatcher" daemon prio=10 tid=0x00000000535ec800
nid=0x2256
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE
"Finalizer" daemon prio=10 tid=0x00000000535cf800 nid=0x2255 in
Object.wait() [0x0000000041219000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaade3472f0> (a
java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <0x00002aaade3472f0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)
"Reference Handler" daemon prio=10 tid=0x00000000535c8000
nid=0x2254 in
Object.wait() [0x0000000041118000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock)
"main" prio=10 tid=0x0000000053554000 nid=0x2245 runnable
[0x0000000040208000]
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:199)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x00002aaade1e5870> (a java.io.BufferedInputStream)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x00002aaade1e29f8> (a java.io.BufferedInputStream)
at
sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedByte(SeedGenerator.java:453)
at
sun.security.provider.SeedGenerator.getSeedBytes(SeedGenerator.java:123)
at
sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:118)
at
sun.security.provider.SecureRandom.engineGenerateSeed(SecureRandom.java:114)
at
sun.security.provider.SecureRandom.engineNextBytes(SecureRandom.java:171)
- locked <0x00002aaade1e2500> (a
sun.security.provider.SecureRandom)
at java.security.SecureRandom.nextBytes(SecureRandom.java:433)
- locked <0x00002aaade1e2830> (a java.security.SecureRandom)
at java.security.SecureRandom.next(SecureRandom.java:455)
at java.util.Random.nextLong(Random.java:284)
at
org.mortbay.jetty.servlet.HashSessionIdManager.doStart(HashSessionIdManager.java:139)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaade1e21c0> (a java.lang.Object)
at
org.mortbay.jetty.servlet.AbstractSessionManager.doStart(AbstractSessionManager.java:168)
at
org.mortbay.jetty.servlet.HashSessionManager.doStart(HashSessionManager.java:67)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaade334c00> (a java.lang.Object)
at
org.mortbay.jetty.servlet.SessionHandler.doStart(SessionHandler.java:115)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaade334b18> (a java.lang.Object)
at
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at
org.mortbay.jetty.handler.ContextHandler.startContext(ContextHandler.java:537)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:136)
at
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1234)
at
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517)
at
org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:460)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaade334ab0> (a java.lang.Object)
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaade332c30> (a java.lang.Object)
at
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:222)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaab44191a0> (a java.lang.Object)
at org.apache.hadoop.http.HttpServer.start(HttpServer.java:460)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:375)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:216)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
"VM Thread" prio=10 tid=0x00000000535c1800 nid=0x2253 runnable
"GC task thread#0 (ParallelGC)" prio=10 tid=0x000000005355e000
nid=0x2246
runnable
"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000053560000
nid=0x2247
runnable
"GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000053561800
nid=0x2248
runnable
"GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000053563800
nid=0x2249
runnable
"GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000053565800
nid=0x224a
runnable
"GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000053567000
nid=0x224b
runnable
"GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000053569000
nid=0x224c
runnable
"GC task thread#7 (ParallelGC)" prio=10 tid=0x000000005356b000
nid=0x224d
runnable
"GC task thread#8 (ParallelGC)" prio=10 tid=0x000000005356c800
nid=0x224e
runnable
"GC task thread#9 (ParallelGC)" prio=10 tid=0x000000005356e800
nid=0x224f
runnable
"GC task thread#10 (ParallelGC)" prio=10 tid=0x0000000053570800
nid=0x2250
runnable
"GC task thread#11 (ParallelGC)" prio=10 tid=0x0000000053572000
nid=0x2251
runnable
"GC task thread#12 (ParallelGC)" prio=10 tid=0x0000000053574000
nid=0x2252
runnable
"VM Periodic Task Thread" prio=10 tid=0x00002aaaf835f800 nid=0x225b
waiting on condition
JNI global references: 715
Heap
PSYoungGen total 5312K, used 5185K [0x00002aaaddde0000,
0x00002aaade5a0000, 0x00002aaaf2b30000)
eden space 4416K, 97% used
[0x00002aaaddde0000,0x00002aaade210688,0x00002aaade230000)
from space 896K, 100% used
[0x00002aaade320000,0x00002aaade400000,0x00002aaade400000)
to space 960K, 0% used
[0x00002aaade230000,0x00002aaade230000,0x00002aaade320000)
PSOldGen total 5312K, used 1172K [0x00002aaab4330000,
0x00002aaab4860000, 0x00002aaaddde0000)
object space 5312K, 22% used
[0x00002aaab4330000,0x00002aaab44550b8,0x00002aaab4860000)
PSPermGen total 21248K, used 13354K [0x00002aaaaef30000,
0x00002aaab03f0000, 0x00002aaab4330000)
object space 21248K, 62% used
[0x00002aaaaef30000,0x00002aaaafc3a818,0x00002aaab03f0000)
Brian Bockelman wrote:
Hey Nick,
I believe the mailing list stripped out your attachment.
Brian
On Sep 29, 2009, at 10:22 AM, Nick Rathke wrote:
Hi,
Here is the dump. I looked it over and unfortunately it is pretty
meaningless to me at this point. Any help deciphering it would
be greatly
appreciated.
I have also now disabled the IB interface on my 2 test systems,
unfortunately that had no impact.
-Nick
Todd Lipcon wrote:
Hi Nick,
Figure out the pid of the DataNode process using either "jps" or
straight
"ps auxw | grep DataNode", and then kill -QUIT <pid>. That
should cause
the
node to dump its stack to its stdout. That'll either end up
in the .out
file
in your logs directory, or on your console, depending how you
started
the
daemon.
-Todd
On Mon, Sep 28, 2009 at 9:11 PM, Nick Rathke <n...@sci.utah.edu>
wrote:
Hi Todd,
Unfortunately it never returns. Gives good info on a running
node.
-bash-3.2# curl http://127.0.0.1:50075/stacks
If I do a stop-all on the master I get
curl: (52) Empty reply from server
on the stuck node.
If I do this in a browser I can see that it is **trying** to
connect,
if I
kill the java process I get "Server not found" but as long
as the java
process are running I just get a black page.
Should I try a TCP dump and see if I can see packets flowing
? would
that
be of any help ?
-Nick
Todd Lipcon wrote:
Hi Nick,
Can you curl http://127.0.0.1:50075/stacks on one of the
stuck nodes
and
paste the result?
Sometimes that can give an indication as to where things
are getting
stuck.
-Todd
On Mon, Sep 28, 2009 at 7:21 PM, Nick Rathke
<n...@sci.utah.edu>
wrote:
FYI I get the same hanging behavior if I follow the Hadoop
quick
start
for
a single node base line configuration ( no modified conf
files)
-Nick
Brian Bockelman wrote:
Hey Nicke,
Do you have any error messages appearing in the log files?
Brian
On Sep 28, 2009, at 2:06 PM, Nick Rathke wrote:
Ted Dunning wrote:
I think that the last time you asked this question, the
suggestion
was
to
look at DNS and make sure that everything is exactly
correct in
the
net-boot
configuration. Hadoop is very sensitive to network
routing and
naming
details.
So,
a) in your net-boot, how are IP addresses assigned?
We assign static IP's based on a node's MAC address via
DHCP so
that
when a node is netbooted or booted with a local OS it
gets the
same IP
and
hostname.
b) how are DNS names propagated?
cluster DNS names are on a mixed in with our facility DNS
servers.
All nodes have proper forward and reverse DNS lookups.
c) how have you guaranteed that (a) and (b) are exactly
consistent?
Host MAC address. I also have manually conformed this.
d) how have your guaranteed that every node can
talk to
every
other node
both by name and IP address?
Local cluster DNS / DHCP + all nodes have all other
nodes host
names
and IP's in /etc/hosts. I have compared all the config
files for
DNS /
DHCP
/ and /etc/hosts to make sure all information is the same.
e) have you assured yourself that any reverse mapping
that exists
is
correct?
Yes, and tested.
One more bit of information. The system boots on a 1Gb
network
all
other
network traffic i.e. MPI and NFS to data volumes is via IB.
The IB network also has proper forward/backwards DNS
entries. IB
IP
address are setup at boot time via a script that takes
the host IP
and
a
fixed offset to calculate the address for the IB
interface. I have
also
confirmed that the IB IP address's match our DNS .
-Nick
On Mon, Sep 28, 2009 at 9:45 AM, Nick Rathke
<n...@sci.utah.edu>
wrote:
I am hopping that someone can help with this issue. I
have a 64
node
cluster that we would like to run Hadoop on, most of
the nodes
are
netbooted
via NFS.
Hadoop runs fine on nodes IF the node uses a local OS
install,
but
doesn't
work when nodes are netbooted. Under netboot I can see
that the
slaves
have
the correct Java processes running, but the Hadoop web
pages
never
shows the
nodes as available. The Hadoop logs on the nodes also
show that
everything
is running and started up correctly.
On the few node that have a local OS installed
everything works
just
fine
and I can run the test jobs without issue (so far).
I am using the identical hadoop install and
configuration
between
netbooted nodes and none netbooted nodes.
Has anyone encountered this type of issue ?
--
Nick Rathke
Scientific Computing and Imaging Institute
Sr. Systems Administrator
n...@sci.utah.edu
www.sci.utah.edu
801-587-9933
801-557-3832
"I came I saw I made it possible" Royal Bliss - Here
They Come
--
Nick Rathke
Scientific Computing and Imaging Institute
Sr. Systems Administrator
n...@sci.utah.edu
www.sci.utah.edu
801-587-9933
801-557-3832
"I came I saw I made it possible" Royal Bliss - Here They Come
--
Nick Rathke
Scientific Computing and Imaging Institute
Sr. Systems Administrator
n...@sci.utah.edu
www.sci.utah.edu
801-587-9933
801-557-3832
"I came I saw I made it possible" Royal Bliss - Here They Come