Hi Jeff, That seems like a reasonable config, but the error message you pasted indicated xceivers was set to 2048 instead of 4096.
Also, in my experience SocketTimeoutExceptions are usually due to swapping. Verify that your machines aren't swapping when you're under load. BTW since this is hbase-related, may be better to move this to the hbase user list. -Todd On Fri, Jun 4, 2010 at 9:37 AM, Jeff Whiting <je...@qualtrics.com> wrote: > I've tried to follow it the best I can. I already increased the ulimit to > 32768. This is what I now have in my hdfs-site.xml. Am I missing anything? > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > <!-- Put site-specific property overrides in this file. --> > > <configuration> > <property> > <name>dfs.data.dir</name> > <value>/media/sdb,/media/sdc,/media/sdd</value> > </property> > > <property> > <name>dfs.replication</name> > <value>3</value> > </property> > <property> > <name>dfs.datanode.max.xcievers</name> > <value>4096</value> > </property> > <property> > <name>dfs.datanode.handler.count</name> > <value>10</value> > </property> > </configuration> > > > . > > Todd Lipcon wrote: > > Hi Jeff, > > Have you followed the HDFS configuration guide from the HBase wiki? You > need to bump up the transceiver count and probably ulimit as well. Looks > like you already tuned to 2048 but isn't high enough if you're still getting > the "exceeds the limit" message. > > The EOFs and Connection Reset messages are when DFS clients are > disconnecting prematurely from a client stream (probably due to xceiver > errors on other streams) > > -Todd > > On Fri, Jun 4, 2010 at 8:56 AM, jeff whiting <je...@qualtrics.com> wrote: > >> I had my HRegionServers go down due to hdfs exception. In the datanode >> logs I'm seeing a lot of different and varied exceptions. I've increased >> the data xceiver count now but these other ones don't make a lot of sense. >> >> Among them are: >> >> :2010-06-04 07:41:56,917 ERROR datanode.DataNode >> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010, >> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075, >> ipcPort=50020):DataXceiver >> -java.io.EOFException >> - at java.io.DataInputStream.readByte(DataInputStream.java:250) >> - at >> org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298) >> - at >> org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319) >> - at org.apache.hadoop.io.Text.readString(Text.java:400) >> - at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:313) >> - at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) >> - at java.lang.Thread.run(Thread.java:619) >> >> >> :2010-06-04 08:49:56,389 ERROR datanode.DataNode >> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010, >> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075, >> ipcPort=50020):DataXceiver >> -java.io.IOException: Connection reset by peer >> - at sun.nio.ch.FileDispatcher.read0(Native Method) >> - at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) >> - at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:233) >> - at sun.nio.ch.IOUtil.read(IOUtil.java:206) >> - at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:236) >> - at >> org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) >> - at >> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) >> - at >> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) >> >> >> :2010-06-04 05:36:54,840 ERROR datanode.DataNode >> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010, >> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075, >> ipcPort=50020):DataXceiver >> -java.io.IOException: xceiverCount 2049 exceeds the limit of concurrent >> xcievers 2047 >> - at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:88) >> - at java.lang.Thread.run(Thread.java:619) >> >> :2010-06-04 05:36:48,848 ERROR datanode.DataNode >> (DataXceiver.java:run(131)) - DatanodeRegistration(192.168.1.184:50010, >> storageID=DS-1601700079-192.168.1.184-50010-1274208308658, infoPort=50075, >> ipcPort=50020):DataXceiver >> -java.net.SocketTimeoutException: 480000 millis timeout while waiting for >> channel to be ready for write. ch : >> java.nio.channels.SocketChannel[connected local=/192.168.1.184:50010remote=/ >> 192.168.1.184:55349] >> - at >> org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) >> - at >> org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) >> - at >> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) >> - at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:313) >> - at >> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:400) >> - at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:180) >> - at >> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:95) >> - at java.lang.Thread.run(Thread.java:619) >> -- >> >> The EOFException is the most common one I get. I'm also unsure how I >> would get a connection reset by peer when I'm connecting locally. Why is >> the file prematurely ending? Any idea of what is going on? >> >> Thanks, >> ~Jeff >> >> -- >> Jeff Whiting >> Qualtrics Senior Software Engineer >> je...@qualtrics.com >> >> >> >> >> >> > > > -- > Todd Lipcon > Software Engineer, Cloudera > > > -- > Jeff Whiting > Qualtrics Senior Software engineerje...@qualtrics.com > > -- Todd Lipcon Software Engineer, Cloudera