Hi Mike, Thanks for trying to help out.
I had a talk with our networking guys this afternoon. According to them (and this is way out of my area of expertise, so excuse any mistakes) multiple interfaces shouldn't be a problem. We could set up a nameserver to resolve hostnames to addresses in our private space when the request comes from one of the nodes, and route this traffic over a single interface. Any other request can be resolved to an address in the public space, which is bound to an other interface. In our current setup we're not even resolving hostnames in our private address space through a nameserver - we do it with an ugly hack in /etc/hosts. And it seems to work alright. Having said that, our problems are still not completely gone even after adjusting the maximum allowed RAM for tasks - although things are lots better. While writing this mail three out of five DN's were marked as dead. There still is some swapping going on, but the cores are not spending any time in WAIT, so this shouldn't be the cause of anything. See below a trace from a dead DN - any thoughts are appreciated! Cheers, Evert 2011-05-13 23:13:27,716 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9131821326787012529_2915672 src: /192.168.28.211:60136 dest: /192.168.28.214:50050 of size 382425 2011-05-13 23:13:27,915 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9132067116195286882_130888 java.io.EOFException: while trying to read 3744913 bytes 2011-05-13 23:13:27,925 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35139, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001437_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 6254000 2011-05-13 23:13:28,032 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9149862728087355005_3793421 src: /192.168.28.210:41197 dest: /192.168.28.214:50050 of size 245767 2011-05-13 23:13:28,033 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9132067116195286882_130888 unfinalized and removed. 2011-05-13 23:13:28,033 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9132067116195286882_130888 received exception java.io.EOFException: while trying to read 3744913 bytes 2011-05-13 23:13:28,033 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 3744913 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122) 2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:32910, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001443_0, offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4323000 2011-05-13 23:13:28,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35138, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001440_0, offset: 197120, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 5573000 2011-05-13 23:13:28,159 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38574, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001444_0, offset: 197632, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 16939000 2011-05-13 23:13:28,209 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9123390874940601805_2898225 src: /192.168.28.210:44227 dest: /192.168.28.214:50050 of size 300441 2011-05-13 23:13:28,217 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:42364, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001451_0, offset: 198656, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 5291000 2011-05-13 23:13:28,252 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:32930, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-1800696633107072247_4099834, duration: 5099000 2011-05-13 23:13:28,256 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:42363, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001458_0, offset: 199680, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4945000 2011-05-13 23:13:28,257 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.214:35137, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001436_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4159000 2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-9140444589483291821_3585975 java.io.EOFException: while trying to read 100 bytes 2011-05-13 23:13:28,258 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block blk_-9140444589483291821_3585975 unfinalized and removed. 2011-05-13 23:13:28,258 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-9140444589483291821_3585975 received exception java.io.EOFException: while trying to read 100 bytes 2011-05-13 23:13:28,259 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 100 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:270) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:378) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:534) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122) 2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38553, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001441_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-5819719631677148140_4098274, duration: 5625000 2011-05-13 23:13:28,264 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.212:38535, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001438_0, offset: 196608, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_-9163184839986480695_4112368, duration: 4473000 2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.28.214:50050, storageID=DS-443352839-145.100.2.183-50050-1291128673616, infoPort=50075, ipcPort=50020): Exception writing block blk_-9150014886921014525_2267869 to mirror 192.168.28.213:50050 java.io.IOException: The stream is closed at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:108) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:540) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:417) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:122) 2011-05-13 23:13:28,265 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /192.168.28.214:50050, dest: /192.168.28.213:45484, bytes: 0, op: HDFS_READ, cliID: DFSClient_attempt_201105131125_0025_m_001432_0, offset: 0, srvID: DS-443352839-145.100.2.183-50050-1291128673616, blockid: blk_405051931214094755_4098504, duration: 5597000 2011-05-13 23:13:28,273 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9150014886921014525_2267869 src: /192.168.28.211:49208 dest: /192.168.28.214:50050 of size 3033173 2011-05-13 23:13:28,313 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received block blk_-9144765354308563975_3310572 src: /192.168.28.211:51592 dest: /192.168.28.214:50050 of size 242383 ________________________________________ From: Segel, Mike [mse...@navteq.com] Sent: Friday, May 13, 2011 2:36 PM To: general@hadoop.apache.org Cc: <cdh-u...@cloudera.org>; <general@hadoop.apache.org> Subject: Re: Stability issue - dead DN's Bonded will work but you may not see the performance you would expect. If you need >1 GBe, go 10GBe less headache and has even more headroom. Multiple interfaces won't work. Or I should say didn't work in past releases. If you think about it, clients have to connect to each node. So having two interfaces and trying to manage them makes no sense. Add to this trying to manage this in DNS ... Why make more work for yourself? Going from memory... It looked like you rDNS had to match you hostnames so your internal interfaces had to match hostnames so you had an inverted network. If you draw out your network topology you end up with a ladder. You would be better off (IMHO) to create a subnet where only your edge servers are dual nic'd. But then if your cluster is for development... Now your PCs can't be used as clients... Does this make sense? Sent from a remote device. Please excuse any typos... Mike Segel On May 13, 2011, at 4:57 AM, "Evert Lammerts" <evert.lamme...@sara.nl> wrote: > Hi Mike, > >> You really really don't want to do this. >> Long story short... It won't work. > > Can you elaborate? Are you talking about the bonded interfaces or about > having a separated network for interconnects and external network? What can > go wrong there? > >> >> Just a suggestion.. You don't want anyone on your cluster itself. They >> should interact wit edge nodes, which are 'Hadoop aware'. Then your >> cluster has a single network to worry about. > > That's our current setup. We have a single headnode that is used as a SPOE. > However, I'd like to change that on our future production system. We want to > implement Kerberos for authentication, and let users interact with the > cluster from their own machine. This would enable them to submit their jobs > from the local IDE. The only way to do this is by opening up Hadoop ports for > the world, is my understanding: if people interact with HDFS they need to be > able to interact with all nodes, right? What would be the argument against > this? > > Cheers, > Evert > >> >> >> Sent from a remote device. Please excuse any typos... >> >> Mike Segel >> >> On May 11, 2011, at 11:45 AM, Allen Wittenauer <a...@apache.org> wrote: >> >>> >>> >>> >>> >>>>> * a 2x1GE bonded network interface for interconnects >>>>> * a 2x1GE bonded network interface for external access >>> >>> Multiple NICs on a box can sometimes cause big performance >> problems with Hadoop. So watch your traffic carefully. >>> >>> >>> The information contained in this communication may be CONFIDENTIAL and is intended only for the use of the recipient(s) named above. If you are not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication, or any of its contents, is strictly prohibited. If you have received this communication in error, please notify the sender and delete/destroy the original message and any copy of it from your computer or paper files.