Hi Brian / Todd,

-bash-3.2# cat /proc/sys/kernel/random/entropy_avail
128

So I did

rngd -r /dev/urandom -o /dev/random -f -t 1 &

and it **seems** to be working.. The web page shows the nodes as there and the logs seem to show that the clients have started correctly, but I have not yet tried to run any jobs.

Thanks for all your help!!

-Nick

Brian Bockelman wrote:
Hey Nick,

Try this:
cat /proc/sys/kernel/random/entropy_avail

Is it a small number (<300)?

Basically, one way Linux generates entropy is via input from the keyboard. So, as soon as you log into the NFS booted server, you've given it enough entropy for HDFS to start up.

Here's a relevant-looking link:

http://rackerhacker.com/2007/07/01/check-available-entropy-in-linux/

Brian

On Sep 29, 2009, at 1:27 PM, Nick Rathke wrote:

Great. I'll look at this fix. Here is what I got based on Brian's info

lsof -p gave me :

java 12739 root 50r CHR 1,8 3335 /dev/random java 12739 root 51r CHR 1,9 3325 /dev/urandom

.
.
.
.

java 12739 root 66r CHR 1,8 3335 /dev/random

Both do exist in /dev

and securerandom.source=file

was already set to

securerandom.source=file:/dev/urandom

I have also checked that the permissions on said file are the same between nfs nodes and local OS nodes.
-Nick



Todd Lipcon wrote:
Yep, this is a common problem. The fix that Brian outlined helps a lot, but if you are *really* strapped for random bits, you'll still block. This is
because even if you've set the random source, it still uses the real
/dev/random to grab a seed for the prng, at least on my system.

On systems where I know I don't care about true randomness, I also use this
trick:

http://www.chrissearle.org/blog/technical/increase_entropy_26_kernel_linux_box

It's very handy for boxes running hudson that start and stop multinode
pseudodistributed hadoop clusters regularly.

-Todd

On Tue, Sep 29, 2009 at 10:16 AM, Brian Bockelman <bbock...@cse.unl.edu>wrote:


Hey Nick,

Strange. It appears that the Jetty server has stalled while trying to read from /dev/random. Is it possible that some part of /dev isn't initialized
before the datanode is launched?

Can you confirm this using "lsof -p <process ID>" ?

I copy/paste a solution I found in a forum via google below.

Brian

Edit $JAVA_HOME/jre/lib/security/java.security and change the property:

securerandom.source=file:/dev/random

to:

securerandom.source=file:/dev/urandom


On Sep 29, 2009, at 11:26 AM, Nick Rathke wrote:

Thanks.  Here it is as in all of it's glory...

-Nick


2009-09-29 09:15:53
Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.2-b01 mixed mode):

"263851...@qtp0-1" prio=10 tid=0x00002aaaf846a000 nid=0x226b in
Object.wait() [0x0000000041d24000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaade3587f8> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:565)
- locked <0x00002aaade3587f8> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)

"1837007...@qtp0-0" prio=10 tid=0x00002aaaf84d4000 nid=0x226a in
Object.wait() [0x0000000041b22000]
java.lang.Thread.State: TIMED_WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaade3592b8> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)
at
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:565)
- locked <0x00002aaade3592b8> (a
org.mortbay.thread.QueuedThreadPool$PoolThread)

"refreshUsed-/tmp/hadoop-root/dfs/data" daemon prio=10
tid=0x00002aaaf8456000 nid=0x2269 waiting on condition [0x0000000041c23000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.hadoop.fs.DU$DURefreshThread.run(DU.java:80)
at java.lang.Thread.run(Thread.java:619)

"RMI TCP Accept-0" daemon prio=10 tid=0x00002aaaf834d800 nid=0x225a
runnable [0x000000004171e000]
java.lang.Thread.State: RUNNABLE
at java.net.PlainSocketImpl.socketAccept(Native Method)
at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:390)
- locked <0x00002aaade358040> (a java.net.SocksSocketImpl)
at java.net.ServerSocket.implAccept(ServerSocket.java:453)
at java.net.ServerSocket.accept(ServerSocket.java:421)
at
sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(LocalRMIServerSocketFactory.java:34)
at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(TCPTransport.java:369)
at
sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(TCPTransport.java:341)
at java.lang.Thread.run(Thread.java:619)

"Low Memory Detector" daemon prio=10 tid=0x00000000535f5000 nid=0x2259
runnable [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"CompilerThread1" daemon prio=10 tid=0x00000000535f1800 nid=0x2258 waiting
on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"CompilerThread0" daemon prio=10 tid=0x00000000535ef000 nid=0x2257 waiting
on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"Signal Dispatcher" daemon prio=10 tid=0x00000000535ec800 nid=0x2256
waiting on condition [0x0000000000000000]
java.lang.Thread.State: RUNNABLE

"Finalizer" daemon prio=10 tid=0x00000000535cf800 nid=0x2255 in
Object.wait() [0x0000000041219000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaade3472f0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:118)
- locked <0x00002aaade3472f0> (a java.lang.ref.ReferenceQueue$Lock)
at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:134)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)

"Reference Handler" daemon prio=10 tid=0x00000000535c8000 nid=0x2254 in
Object.wait() [0x0000000041118000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock)
at java.lang.Object.wait(Object.java:485)
at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)
- locked <0x00002aaade3a2018> (a java.lang.ref.Reference$Lock)

"main" prio=10 tid=0x0000000053554000 nid=0x2245 runnable
[0x0000000040208000]
java.lang.Thread.State: RUNNABLE
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:199)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x00002aaade1e5870> (a java.io.BufferedInputStream)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:258)
at java.io.BufferedInputStream.read(BufferedInputStream.java:317)
- locked <0x00002aaade1e29f8> (a java.io.BufferedInputStream)
at
sun.security.provider.SeedGenerator$URLSeedGenerator.getSeedByte(SeedGenerator.java:453)
at
sun.security.provider.SeedGenerator.getSeedBytes(SeedGenerator.java:123)
at
sun.security.provider.SeedGenerator.generateSeed(SeedGenerator.java:118)
at
sun.security.provider.SecureRandom.engineGenerateSeed(SecureRandom.java:114)
at
sun.security.provider.SecureRandom.engineNextBytes(SecureRandom.java:171)
- locked <0x00002aaade1e2500> (a sun.security.provider.SecureRandom)
at java.security.SecureRandom.nextBytes(SecureRandom.java:433)
- locked <0x00002aaade1e2830> (a java.security.SecureRandom)
at java.security.SecureRandom.next(SecureRandom.java:455)
at java.util.Random.nextLong(Random.java:284)
at
org.mortbay.jetty.servlet.HashSessionIdManager.doStart(HashSessionIdManager.java:139)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaade1e21c0> (a java.lang.Object)
at
org.mortbay.jetty.servlet.AbstractSessionManager.doStart(AbstractSessionManager.java:168)
at
org.mortbay.jetty.servlet.HashSessionManager.doStart(HashSessionManager.java:67)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaade334c00> (a java.lang.Object)
at
org.mortbay.jetty.servlet.SessionHandler.doStart(SessionHandler.java:115)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaade334b18> (a java.lang.Object)
at
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at
org.mortbay.jetty.handler.ContextHandler.startContext(ContextHandler.java:537)
at org.mortbay.jetty.servlet.Context.startContext(Context.java:136)
at
org.mortbay.jetty.webapp.WebAppContext.startContext(WebAppContext.java:1234)
at
org.mortbay.jetty.handler.ContextHandler.doStart(ContextHandler.java:517) at org.mortbay.jetty.webapp.WebAppContext.doStart(WebAppContext.java:460)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaade334ab0> (a java.lang.Object)
at
org.mortbay.jetty.handler.HandlerCollection.doStart(HandlerCollection.java:152)
at
org.mortbay.jetty.handler.ContextHandlerCollection.doStart(ContextHandlerCollection.java:156)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaade332c30> (a java.lang.Object)
at
org.mortbay.jetty.handler.HandlerWrapper.doStart(HandlerWrapper.java:130)
at org.mortbay.jetty.Server.doStart(Server.java:222)
at
org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:50)
- locked <0x00002aaab44191a0> (a java.lang.Object)
at org.apache.hadoop.http.HttpServer.start(HttpServer.java:460)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:375)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:216)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

"VM Thread" prio=10 tid=0x00000000535c1800 nid=0x2253 runnable

"GC task thread#0 (ParallelGC)" prio=10 tid=0x000000005355e000 nid=0x2246
runnable

"GC task thread#1 (ParallelGC)" prio=10 tid=0x0000000053560000 nid=0x2247
runnable

"GC task thread#2 (ParallelGC)" prio=10 tid=0x0000000053561800 nid=0x2248
runnable

"GC task thread#3 (ParallelGC)" prio=10 tid=0x0000000053563800 nid=0x2249
runnable

"GC task thread#4 (ParallelGC)" prio=10 tid=0x0000000053565800 nid=0x224a
runnable

"GC task thread#5 (ParallelGC)" prio=10 tid=0x0000000053567000 nid=0x224b
runnable

"GC task thread#6 (ParallelGC)" prio=10 tid=0x0000000053569000 nid=0x224c
runnable

"GC task thread#7 (ParallelGC)" prio=10 tid=0x000000005356b000 nid=0x224d
runnable

"GC task thread#8 (ParallelGC)" prio=10 tid=0x000000005356c800 nid=0x224e
runnable

"GC task thread#9 (ParallelGC)" prio=10 tid=0x000000005356e800 nid=0x224f
runnable

"GC task thread#10 (ParallelGC)" prio=10 tid=0x0000000053570800 nid=0x2250
runnable

"GC task thread#11 (ParallelGC)" prio=10 tid=0x0000000053572000 nid=0x2251
runnable

"GC task thread#12 (ParallelGC)" prio=10 tid=0x0000000053574000 nid=0x2252
runnable

"VM Periodic Task Thread" prio=10 tid=0x00002aaaf835f800 nid=0x225b
waiting on condition

JNI global references: 715

Heap
PSYoungGen      total 5312K, used 5185K [0x00002aaaddde0000,
0x00002aaade5a0000, 0x00002aaaf2b30000)
eden space 4416K, 97% used
[0x00002aaaddde0000,0x00002aaade210688,0x00002aaade230000)
from space 896K, 100% used
[0x00002aaade320000,0x00002aaade400000,0x00002aaade400000)
to   space 960K, 0% used
[0x00002aaade230000,0x00002aaade230000,0x00002aaade320000)
PSOldGen        total 5312K, used 1172K [0x00002aaab4330000,
0x00002aaab4860000, 0x00002aaaddde0000)
object space 5312K, 22% used
[0x00002aaab4330000,0x00002aaab44550b8,0x00002aaab4860000)
PSPermGen       total 21248K, used 13354K [0x00002aaaaef30000,
0x00002aaab03f0000, 0x00002aaab4330000)
object space 21248K, 62% used
[0x00002aaaaef30000,0x00002aaaafc3a818,0x00002aaab03f0000)




Brian Bockelman wrote:


Hey Nick,

I believe the mailing list stripped out your attachment.

Brian

On Sep 29, 2009, at 10:22 AM, Nick Rathke wrote:

Hi,

Here is the dump. I looked it over and unfortunately it is pretty
meaningless to me at this point. Any help deciphering it would be greatly
appreciated.

I have also now disabled the IB interface on my 2 test systems,
unfortunately that had no impact.

-Nick


Todd Lipcon wrote:


Hi Nick,

Figure out the pid of the DataNode process using either "jps" or
straight
"ps auxw | grep DataNode", and then kill -QUIT <pid>. That should cause
the
node to dump its stack to its stdout. That'll either end up in the .out
file
in your logs directory, or on your console, depending how you started
the
daemon.

-Todd

On Mon, Sep 28, 2009 at 9:11 PM, Nick Rathke <n...@sci.utah.edu>
wrote:


Hi Todd,

Unfortunately it never returns. Gives good info on a running node.

-bash-3.2# curl http://127.0.0.1:50075/stacks

If I do a stop-all on the master I get

curl: (52) Empty reply from server

on the stuck node.

If I do this in a browser I can see that it is **trying** to connect,
if I
kill the java process I get "Server not found" but as long as the java
process are running I just get a black page.

Should I try a TCP dump and see if I can see packets flowing ? would
that
be of any help ?

-Nick



Todd Lipcon wrote:


Hi Nick,

Can you curl http://127.0.0.1:50075/stacks on one of the stuck nodes
and
paste the result?

Sometimes that can give an indication as to where things are getting
stuck.

-Todd

On Mon, Sep 28, 2009 at 7:21 PM, Nick Rathke <n...@sci.utah.edu>
wrote:




FYI I get the same hanging behavior if I follow the Hadoop quick

start
for
a single node base line configuration ( no modified conf files)

-Nick



Brian Bockelman wrote:




Hey Nicke,

Do you have any error messages appearing in the log files?

Brian

On Sep 28, 2009, at 2:06 PM, Nick Rathke wrote:

Ted Dunning wrote:



I think that the last time you asked this question, the suggestion

was

to

look at DNS and make sure that everything is exactly correct in
the
net-boot
configuration. Hadoop is very sensitive to network routing and
naming
details.

So,

a) in your net-boot, how are IP addresses assigned?

We assign static IP's based on a node's MAC address via DHCP so
that



when a node is netbooted or booted with a local OS it gets the

same IP
and
hostname.




b) how are DNS names propagated?

cluster DNS names are on a mixed in with our facility DNS
servers.



All nodes have proper forward and reverse DNS lookups.



c) how have you guaranteed that (a) and (b) are exactly

consistent?

Host MAC address. I also have manually conformed this.
     d) how have your guaranteed that every node can talk to
every
other node
both by name and IP address?

Local cluster DNS / DHCP + all nodes have all other nodes host
names



and IP's in /etc/hosts. I have compared all the config files for

DNS /
DHCP
/ and /etc/hosts to make sure all information is the same.




e) have you assured yourself that any reverse mapping that exists

is
correct?

Yes, and tested.



One more bit of information. The system boots on a 1Gb network

all
other
network traffic i.e. MPI and NFS to data volumes is via IB.

The IB network also has proper forward/backwards DNS entries. IB
IP
address are setup at boot time via a script that takes the host IP
and
a
fixed offset to calculate the address for the IB interface. I have
also
confirmed that the IB IP address's match our DNS .

-Nick


On Mon, Sep 28, 2009 at 9:45 AM, Nick Rathke <n...@sci.utah.edu>
wrote:



I am hopping that someone can help with this issue. I have a 64

node



cluster that we would like to run Hadoop on, most of the nodes

are
netbooted
via NFS.

Hadoop runs fine on nodes IF the node uses a local OS install,
but
doesn't
work when nodes are netbooted. Under netboot I can see that the
slaves
have
the correct Java processes running, but the Hadoop web pages
never
shows the
nodes as available. The Hadoop logs on the nodes also show that
everything
is running and started up correctly.

On the few node that have a local OS installed everything works
just
fine
and I can run the test jobs without issue (so far).

I  am using the identical hadoop install and configuration
between
netbooted nodes and none netbooted nodes.

Has anyone encountered this type of issue ?









--

Nick Rathke
Scientific Computing and Imaging Institute
Sr. Systems Administrator
n...@sci.utah.edu
www.sci.utah.edu
801-587-9933
801-557-3832

"I came I saw I made it possible" Royal Bliss - Here They Come







--
Nick Rathke
Scientific Computing and Imaging Institute
Sr. Systems Administrator
n...@sci.utah.edu
www.sci.utah.edu
801-587-9933
801-557-3832

"I came I saw I made it possible" Royal Bliss - Here They Come



--
Nick Rathke
Scientific Computing and Imaging Institute
Sr. Systems Administrator
n...@sci.utah.edu
www.sci.utah.edu
801-587-9933
801-557-3832

"I came I saw I made it possible" Royal Bliss - Here They Come







--
Nick Rathke
Scientific Computing and Imaging Institute
Sr. Systems Administrator
n...@sci.utah.edu
www.sci.utah.edu
801-587-9933
801-557-3832

"I came I saw I made it possible" Royal Bliss - Here They Come



--
Nick Rathke
Scientific Computing and Imaging Institute
Sr. Systems Administrator
n...@sci.utah.edu
www.sci.utah.edu
801-587-9933
801-557-3832

"I came I saw I made it possible" Royal Bliss - Here They Come

Reply via email to