Hi, cluster details: hbase 0.90.2. 10 machines. 1GB switch.
use-case M/R job that inserts about 10 million rows to hbase in the reducer, followed by M/R that works with hdfs files. When the first job maps finish the second job maps starts and region server crushes. please note, that when running the 2 jobs separately they both finish successfully. >From our monitoring we see that when the 2 jobs work together the network load reaches to our max bandwidth - 1GB. In the region server log we see these exceptions: a. 2011-08-14 18:37:36,263 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call multi(org.apache.hadoop.hbase.client.MultiAction@491fb2f4) from 10.11.87.73:33737: output error 2011-08-14 18:37:36,264 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server handler 24 on 8041 caught: java.nio.channels.ClosedChannelException at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) at org.apache.hadoop.hbase.ipc.HBaseServer.channelIO(HBaseServer.java:1387) at org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1339) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.processResponse(HBaseServer.java:727) at org.apache.hadoop.hbase.ipc.HBaseServer$Responder.doRespond(HBaseServer.java:792) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1083) b. 2011-08-14 18:41:56,225 WARN org.apache.hadoop.hdfs.DFSClient: DFSOutputStream ResponseProcessor exception for block blk_-8181634225601608891_579246java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at java.io.DataInputStream.readLong(DataInputStream.java:399) at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:122) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:2548) c. 2011-08-14 18:42:02,960 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #0 from primary datanode 10.11.87.72:50010 org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.ipc.RemoteException: java.io.IOException: blk_-8181634225601608891_579246 is already commited, storedBlock == null. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.nextGenerationStampForBlock(FSNamesystem.java:4877) at org.apache.hadoop.hdfs.server.namenode.NameNode.nextGenerationStamp(NameNode.java:501) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955) at org.apache.hadoop.ipc.Client.call(Client.java:740) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy4.nextGenerationStamp(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1577) at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1551) at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1617) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:961) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:957) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:955) at org.apache.hadoop.ipc.Client.call(Client.java:740) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy9.recoverBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2706) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1500(DFSClient.java:2173) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2372) Few questions: 1. Can we configure hadoop/hbase not to consume all network resources (e.g., to specify upper limit for map/reduce network load)? 2. Should we increase the timeout for open connections ? 3. Can we assign different IPs for data transfer and region quorum check protocol (zookeeper) ? Thanks, Lior