I had read somewhere 2.7 has lots of issues so you should wait for 2.7.1 where most of them are getting addressed
On Mon, Apr 27, 2015 at 2:14 PM, 조주일 <tjst...@kgrid.co.kr> wrote: > > > I think heartbeat failure cause is hang of nodes. > > I found a bug report associated with this problem. > > > > https://issues.apache.org/jira/browse/HDFS-7489 > > https://issues.apache.org/jira/browse/HDFS-7496 > > https://issues.apache.org/jira/browse/HDFS-7531 > > https://issues.apache.org/jira/browse/HDFS-8051 > > > > It has been fixed in 2.7. > > > > I do not have experience patch. > > And Because of this stability has not been confirmed, I can not upgrade > to 2.7. > > > > What do you recommend for that? > > > > How can I do the patch, if I will do patch? > > Can I patch without service dowtime. > > > > > > > > > > -----Original Message----- > *From:* "Drake민영근"<drake....@nexr.com> > *To:* "user"<user@hadoop.apache.org>; "조주일"<tjst...@kgrid.co.kr>; > *Cc:* > *Sent:* 2015-04-24 (금) 17:41:59 > *Subject:* Re: rolling upgrade(2.4.1 to 2.6.0) problem > > > Hi, > > I think limited by "max user processes". see this: > https://plumbr.eu/outofmemoryerror/unable-to-create-new-native-thread In > your case, user cannot create more than 10240 processes. In our env, the > limit is more like "65000". > > I think it's worth a try. And, if hdfs datanode daemon's user is not root, > set the limit file into /etc/security/limits.d > > Thanks. > > Drake 민영근 Ph.D > kt NexR > > On Fri, Apr 24, 2015 at 5:15 PM, 조주일 <tjst...@kgrid.co.kr> wrote: > > ulimit -a > > core file size (blocks, -c) 0 > > data seg size (kbytes, -d) unlimited > > scheduling priority (-e) 0 > > file size (blocks, -f) unlimited > > pending signals (-i) 62580 > > max locked memory (kbytes, -l) 64 > > max memory size (kbytes, -m) unlimited > > open files (-n) 102400 > > pipe size (512 bytes, -p) 8 > > POSIX message queues (bytes, -q) 819200 > > real-time priority (-r) 0 > > stack size (kbytes, -s) 10240 > > cpu time (seconds, -t) unlimited > > max user processes (-u) 10240 > > virtual memory (kbytes, -v) unlimited > > file locks (-x) unlimited > > > > ------------------------------------------------------ > > Hadoop cluster was operating normally in the 2.4.1 version. > > Hadoop cluster is a problem in version 2.6. > > > > E.g > > > > Slow BlockReceiver logs are often seen > > "org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write > data to disk cost" > > > > If the data node failure and under-block occurs, > > another many nodes heartbeat check is fails. > > So, I stop all nodes and I start all nodes. > > The cluster is then normalized. > > > > In this regard, Hadoop Is there a difference between version 2.4 and 2.6? > > > > > > ulimit -a > > core file size (blocks, -c) 0 > > data seg size (kbytes, -d) unlimited > > scheduling priority (-e) 0 > > file size (blocks, -f) unlimited > > pending signals (-i) 62580 > > max locked memory (kbytes, -l) 64 > > max memory size (kbytes, -m) unlimited > > open files (-n) 102400 > > pipe size (512 bytes, -p) 8 > > POSIX message queues (bytes, -q) 819200 > > real-time priority (-r) 0 > > stack size (kbytes, -s) 10240 > > cpu time (seconds, -t) unlimited > > max user processes (-u) 10240 > > virtual memory (kbytes, -v) unlimited > > file locks (-x) unlimited > > > > > > -----Original Message----- > *From:* "Drake민영근"<drake....@nexr.com> > *To:* "user"<user@hadoop.apache.org>; "조주일"<tjst...@kgrid.co.kr>; > *Cc:* > *Sent:* 2015-04-24 (금) 16:58:46 > *Subject:* Re: rolling upgrade(2.4.1 to 2.6.0) problem > > HI, > > How about the ulimit setting of the user for hdfs datanode ? > > Drake 민영근 Ph.D > kt NexR > > On Wed, Apr 22, 2015 at 6:25 PM, 조주일 <tjst...@kgrid.co.kr> wrote: > > > > I allocated 5G. > > I think OOM is not the cause of essentially > > > > -----Original Message----- > *From:* "Han-Cheol Cho"<hancheol....@nhn-playart.com> > *To:* <user@hadoop.apache.org>; > *Cc:* > *Sent:* 2015-04-22 (수) 15:32:35 > *Subject:* RE: rolling upgrade(2.4.1 to 2.6.0) problem > > > Hi, > > > > The first warning shows out-of-memory error of JVM. > > Did you give enough max heap memory for DataNode daemons? > > DN daemons, by default, uses max heap size 1GB. So if your DN requires > more > > than that, it will be in a trouble. > > > You can check the memory consumption of you DN dameons (e.g., top > command) > > and the memory allocated to them by -Xmx option (e.g., jps -lmv). > > If the max heap size is too small, you can use HADOOP_DATANODE_OPTS > variable > > (e.g., HADOOP_DATANODE_OPTS="-Xmx4g") to override it. > > > > Best wishes, > > Han-Cheol > > > > > > > > > > > > -----Original Message----- > *From:* "조주일"<tjst...@kgrid.co.kr> > *To:* <user@hadoop.apache.org>; > *Cc:* > *Sent:* 2015-04-22 (수) 14:54:16 > *Subject:* rolling upgrade(2.4.1 to 2.6.0) problem > > > > > My Cluster is.. > > hadoop 2.4.1 > > Capacity : 1.24PB > > Used 1.1PB > > 16 Datanodes > > Each node is a capacity of 65TB, 96TB, 80TB, Etc.. > > > > I had to proceed with the rolling upgrade 2.4.1 to 2.6.0. > > A data node upgraded takes about 40 minutes. > > Occurs during the upgrade is in progress under-block. > > > > 10 nodes completed upgrade 2.6.0. > > Had a problem at some point during a rolling upgrade of the remaining > nodes. > > > > Heartbeat of the many nodes(2.6.0 only) has failed. > > > > I did changes the following attributes but I did not fix the problem, > > dfs.datanode.handler.count = 100 ---> 300, 400, 500 > > dfs.datanode.max.transfer.threads = 4096 ---> 8000, 10000 > > > > I think, > > 1. Something that causes a delay in processing threads. I think it may be > because the block replication between different versions. > > 2. Whereby the many handlers and xceiver became necessary. > > 3. Whereby the out of memory, an error occurs. Or the problem arises on a > datanode. > > 4. Heartbeat fails, and datanode dies. > > > I found a datanode error log for the following: > > However, it is impossible to determine the cause. > > > > I think, therefore I am. Called because it blocks the replication between > different versions > > > > Give me someone help me !! > > > > DATANODE LOG > > -------------------------------------------------------------------------- > > ### I had to check a few thousand close_wait connection from the datanode. > > > > org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write > packet to mirror took 1207ms (threshold=300ms) > > > > 2015-04-21 22:46:01,772 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. > Will retry in 30 seconds. > > java.lang.OutOfMemoryError: unable to create new native thread > > at java.lang.Thread.start0(Native Method) > > at java.lang.Thread.start(Thread.java:640) > > at > org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:145) > > at java.lang.Thread.run(Thread.java:662) > > 2015-04-21 22:49:45,378 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: > datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver > count 8193 exceeds the limit of concurrent xcievers: 8192 > > at > org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140) > > at java.lang.Thread.run(Thread.java:662) > > 2015-04-22 01:01:25,632 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: > datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver > count 8193 exceeds the limit of concurrent xcievers: 8192 > > at > org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140) > > at java.lang.Thread.run(Thread.java:662) > > 2015-04-22 03:49:44,125 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: > datanode-192.168.1.204:40010:DataXceiver error processing READ_BLOCK > operation src: /192.168.2.174:45606 dst: /192.168.1.204:40010 > > java.io.IOException: cannot find BPOfferService for > bpid=BP-1770955034-0.0.0.0-1401163460236 > > at > org.apache.hadoop.hdfs.server.datanode.DataNode.getDNRegistrationForBP(DataNode.java:1387) > > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:470) > > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116) > > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) > > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235) > > at java.lang.Thread.run(Thread.java:662) > > 2015-04-22 05:30:28,947 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: > DatanodeRegistration(192.168.1.203, > datanodeUuid=654f22ef-84b3-4ecb-a959-2ea46d817c19, infoPort=40075, > ipcPort=40020, > storageInfo=lv=-56;cid=CID-CLUSTER;nsid=239138164;c=1404883838982):Failed > to transfer BP-1770955034-0.0.0.0-1401163460236:blk_1075354042_1613403 to > 192.168.2.156:40010 got > > java.net.SocketException: Original Exception : java.io.IOException: > Connection reset by peer > > at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) > > at > sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:405) > > at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:506) > > at > org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223) > > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:559) > > at > org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:728) > > at > org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2017) > > at java.lang.Thread.run(Thread.java:662) > > Caused by: java.io.IOException: Connection reset by peer > > ... 8 more > > > > > > > > > -- Nitin Pawar