Hi Mike,

Thanks for trying to help out.

I had a talk with our networking guys this afternoon. According to them (and 
this is way out of my area of expertise, so excuse any mistakes) multiple 
interfaces shouldn't be a problem. We could set up a nameserver to resolve 
hostnames to addresses in our private space when the request comes from one of 
the nodes, and route this traffic over a single interface. Any other request 
can be resolved to an address in the public space, which is bound to an other 
interface. In our current setup we're not even resolving hostnames in our 
private address space through a nameserver - we do it with an ugly hack in 
/etc/hosts. And it seems to work alright.

Having said that, our problems are still not completely gone even after 
adjusting the maximum allowed RAM for tasks - although things are lots better. 
While writing this mail three out of five DN's were marked as dead. There still 
is some swapping going on, but the cores are not spending any time in WAIT, so 
this shouldn't be the cause of anything. See below a trace from a dead DN - any 
thoughts are appreciated!

Cheers,
Evert

2011-05-13 23:13:27,716 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Received block blk_-9131821326787012529_2915672 src: /192.168.28.211:60136 
dest: /192.168.28.214:50050 of size 382425
2011-05-13 23:13:27,915 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception in receiveBlock for block blk_-9132067116195286882_130888 
java.io.EOFException: while trying to read 3744913 bytes
2011-05-13 23:13:27,925 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.214:50050, dest: /192.168.28.214:35139, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201105131125_0025_m_001437_0, offset: 196608, srvID: 
DS-443352839-145.100.2.183-50050-1291128673616, blockid: 
blk_-9163184839986480695_4112368, duration: 6254000
2011-05-13 23:13:28,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Received block blk_-9149862728087355005_3793421 src: /192.168.28.210:41197 
dest: /192.168.28.214:50050 of size 245767
2011-05-13 23:13:28,033 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block blk_-9132067116195286882_130888 unfinalized and removed. 
2011-05-13 23:13:28,033 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
writeBlock blk_-9132067116195286882_130888 received exception 
java.io.EOFException: while trying to read 3744913 bytes
2011-05-13 23:13:28,033 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(192.168.28.214:50050, 
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 3744913 bytes
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2011-05-13 23:13:28,038 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.214:50050, dest: /192.168.28.214:32910, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201105131125_0025_m_001443_0, offset: 197632, srvID: 
DS-443352839-145.100.2.183-50050-1291128673616, blockid: 
blk_-9163184839986480695_4112368, duration: 4323000
2011-05-13 23:13:28,038 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.214:50050, dest: /192.168.28.214:35138, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201105131125_0025_m_001440_0, offset: 197120, srvID: 
DS-443352839-145.100.2.183-50050-1291128673616, blockid: 
blk_-9163184839986480695_4112368, duration: 5573000
2011-05-13 23:13:28,159 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.214:50050, dest: /192.168.28.212:38574, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201105131125_0025_m_001444_0, offset: 197632, srvID: 
DS-443352839-145.100.2.183-50050-1291128673616, blockid: 
blk_-9163184839986480695_4112368, duration: 16939000
2011-05-13 23:13:28,209 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Received block blk_-9123390874940601805_2898225 src: /192.168.28.210:44227 
dest: /192.168.28.214:50050 of size 300441
2011-05-13 23:13:28,217 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.214:50050, dest: /192.168.28.213:42364, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201105131125_0025_m_001451_0, offset: 198656, srvID: 
DS-443352839-145.100.2.183-50050-1291128673616, blockid: 
blk_-9163184839986480695_4112368, duration: 5291000
2011-05-13 23:13:28,252 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.214:50050, dest: /192.168.28.214:32930, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 0, srvID: 
DS-443352839-145.100.2.183-50050-1291128673616, blockid: 
blk_-1800696633107072247_4099834, duration: 5099000
2011-05-13 23:13:28,256 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.214:50050, dest: /192.168.28.213:42363, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201105131125_0025_m_001458_0, offset: 199680, srvID: 
DS-443352839-145.100.2.183-50050-1291128673616, blockid: 
blk_-9163184839986480695_4112368, duration: 4945000
2011-05-13 23:13:28,257 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.214:50050, dest: /192.168.28.214:35137, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 196608, srvID: 
DS-443352839-145.100.2.183-50050-1291128673616, blockid: 
blk_-9163184839986480695_4112368, duration: 4159000
2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Exception in receiveBlock for block blk_-9140444589483291821_3585975 
java.io.EOFException: while trying to read 100 bytes
2011-05-13 23:13:28,258 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block blk_-9140444589483291821_3585975 unfinalized and removed. 
2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
writeBlock blk_-9140444589483291821_3585975 received exception 
java.io.EOFException: while trying to read 100 bytes
2011-05-13 23:13:28,259 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(192.168.28.214:50050, 
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, 
ipcPort=50020):DataXceiver
java.io.EOFException: while trying to read 100 bytes
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)
2011-05-13 23:13:28,264 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.214:50050, dest: /192.168.28.212:38553, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201105131125_0025_m_001441_0, offset: 0, srvID: 
DS-443352839-145.100.2.183-50050-1291128673616, blockid: 
blk_-5819719631677148140_4098274, duration: 5625000
2011-05-13 23:13:28,264 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.214:50050, dest: /192.168.28.212:38535, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201105131125_0025_m_001438_0, offset: 196608, srvID: 
DS-443352839-145.100.2.183-50050-1291128673616, blockid: 
blk_-9163184839986480695_4112368, duration: 4473000
2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeRegistration(192.168.28.214:50050, 
storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, 
ipcPort=50020): Exception writing block blk_-9150014886921014525_2267869 to 
mirror 192.168.28.213:50050
java.io.IOException: The stream is closed
        at 
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:108)
        at 
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
        at java.io.DataOutputStream.flush(DataOutputStream.java:106)
        at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:540)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417)
        at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122)

2011-05-13 23:13:28,265 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/192.168.28.214:50050, dest: /192.168.28.213:45484, bytes: 0, op: HDFS_READ, 
cliID: DFSClient_attempt_201105131125_0025_m_001432_0, offset: 0, srvID: 
DS-443352839-145.100.2.183-50050-1291128673616, blockid: 
blk_405051931214094755_4098504, duration: 5597000
2011-05-13 23:13:28,273 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Received block blk_-9150014886921014525_2267869 src: /192.168.28.211:49208 
dest: /192.168.28.214:50050 of size 3033173
2011-05-13 23:13:28,313 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Received block blk_-9144765354308563975_3310572 src: /192.168.28.211:51592 
dest: /192.168.28.214:50050 of size 242383

________________________________________
From: Segel, Mike [mse...@navteq.com]
Sent: Friday, May 13, 2011 2:36 PM
To: general@hadoop.apache.org
Cc: <cdh-u...@cloudera.org>; <general@hadoop.apache.org>
Subject: Re: Stability issue - dead DN's

Bonded will work but you may not see the performance you would expect.  If you 
need >1 GBe, go 10GBe less headache and has even more headroom.

Multiple interfaces won't work. Or I should say didn't work in past releases.
If you think about it, clients have to connect to each node. So having two 
interfaces and trying to manage them makes no sense.

Add to this trying to manage this in DNS ... Why make more work for yourself?
Going from memory... It looked like you rDNS had to match you hostnames so your 
internal interfaces had to match hostnames so you had an inverted network.

If you draw out your network topology you end up with a ladder.
You would be better off (IMHO) to create a subnet where only your edge servers 
are dual nic'd.
But then if your cluster is for development... Now your PCs can't be used as 
clients...

Does this make sense?


Sent from a remote device. Please excuse any typos...

Mike Segel

On May 13, 2011, at 4:57 AM, "Evert Lammerts" <evert.lamme...@sara.nl> wrote:

> Hi Mike,
>
>> You really really don't want to do this.
>> Long story short... It won't work.
>
> Can you elaborate? Are you talking about the bonded interfaces or about 
> having a separated network for interconnects and external network? What can 
> go wrong there?
>
>>
>> Just a suggestion.. You don't want anyone on your cluster itself. They
>> should interact wit edge nodes, which are 'Hadoop aware'. Then your
>> cluster has a single network to worry about.
>
> That's our current setup. We have a single headnode that is used as a SPOE. 
> However, I'd like to change that on our future production system. We want to 
> implement Kerberos for authentication, and let users interact with the 
> cluster from their own machine. This would enable them to submit their jobs 
> from the local IDE. The only way to do this is by opening up Hadoop ports for 
> the world, is my understanding: if people interact with HDFS they need to be 
> able to interact with all nodes, right? What would be the argument against 
> this?
>
> Cheers,
> Evert
>
>>
>>
>> Sent from a remote device. Please excuse any typos...
>>
>> Mike Segel
>>
>> On May 11, 2011, at 11:45 AM, Allen Wittenauer <a...@apache.org> wrote:
>>
>>>
>>>
>>>
>>>
>>>>> * a 2x1GE bonded network interface for interconnects
>>>>> * a 2x1GE bonded network interface for external access
>>>
>>>   Multiple NICs on a box can sometimes cause big performance
>> problems with Hadoop.  So watch your traffic carefully.
>>>
>>>
>>>


The information contained in this communication may be CONFIDENTIAL and is 
intended only for the use of the recipient(s) named above.  If you are not the 
intended recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication, or any of its contents, is 
strictly prohibited.  If you have received this communication in error, please 
notify the sender and delete/destroy the original message and any copy of it 
from your computer or paper files.

Reply via email to