Hey guys, thanks for help but I am stuck. I tried changing the GC: " instead of CMSIncrementalMode try UseParNewGC"
Also checked for swap, which in vmstat is always zero and analizying top is not an option. Load factor never gets higher than 10.0 in a 16 cpu and usually it I around 1.5. Finally, I tried the "-XX:MaxDirectMemorySize=2G" in the datanode, but nothing changed. Datanode still has a lot of the following errors and RS keep falling 3 times a day after GC timeout: 2012-07-16 10:13:13,362 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(172.17.2.22:50010, storageID=DS-554036718-127.0.0.1-50010-1318903052632, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 65557 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:290) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:334) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:398) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:494) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:183) ----------------------------------------------------------------------- 2012-07-16 10:14:25,583 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(172.17.2.22:50010, storageID=DS-554036718-127.0.0.1-50010-1318903052632, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/172.17.2.22:50010 remote=/172.17.2.22:49590] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175) I have not tried the flag in the RS, but I really want to solve the DN problems above!! Guys, do you have any idea? Thanks, Pablo -----Original Message----- From: Laxman [mailto:lakshman...@huawei.com] Sent: quinta-feira, 12 de julho de 2012 01:22 To: Pablo Musa; user@hbase.apache.org Subject: RE: Hmaster and HRegionServer disappearance reason to ask > > 1) Fix the direct memory usage to a fixed value - > XX:MaxDirectMemorySize=1G > > This flag should be in RS ou DN? We need to apply for both but limit can be increased based on your load (May be 2G). Also we can to apply to all processes which are having following symptoms. 1) Allocated heap is few GB (4 to 8) 2) VIRT/RES will occupy double the heap (like 15GB) or even more 3) Long pauses in GC log (allocated heap is just <=8GB) 4) Your application uses lot of NIO/RMI calls(Ex: DataNode, RegionServer) In our cluster we apply for all server processes (NN, DN, HM, RS, JT, TT, ZooKeeper). Long pauses are disappeared after we set this flag (esp. for DN and RS). -- Regards, Laxman