Re: rolling upgrade(2.4.1 to 2.6.0) problem
I think heartbeat failure cause is hang of nodes. I found a bug report associated with this problem. https://issues.apache.org/jira/browse/HDFS-7489 https://issues.apache.org/jira/browse/HDFS-7496 https://issues.apache.org/jira/browse/HDFS-7531 https://issues.apache.org/jira/browse/HDFS-8051 It has been fixed in 2.7. I do not have experience patch. And Because of this stability has not been confirmed, I can not upgrade to 2.7. What do you recommend for that? How can I do the patch, if I will do patch? Can I patch without service dowtime. -Original Message- From: Drake민영근lt;drake@nexr.comgt; To: userlt;user@hadoop.apache.orggt;; 조주일lt;tjst...@kgrid.co.krgt;; Cc: Sent: 2015-04-24 (금) 17:41:59 Subject: Re: rolling upgrade(2.4.1 to 2.6.0) problem Hi, I think limited by max user processes. see this: https://plumbr.eu/outofmemoryerror/unable-to-create-new-native-thread In your case, user cannot create more than 10240 processes. In our env, the limit is more like 65000. I think it's worth a try. And, if hdfs datanode daemon's user is not root, set the limit file into /etc/security/limits.d Thanks.Drake 민영근 Ph.Dkt NexR On Fri, Apr 24, 2015 at 5:15 PM, 조주일 lt;tjst...@kgrid.co.krgt; wrote: ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 62580 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 102400 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 10240 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited -- Hadoop cluster was operating normally in the 2.4.1 version. Hadoop cluster is a problem in version 2.6. E.g Slow BlockReceiver logs are often seen org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost If the data node failure and under-block occurs, another many nodes heartbeat check is fails. So, I stop all nodes and I start all nodes. The cluster is then normalized. In this regard, Hadoop Is there a difference between version 2.4 and 2.6? ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 62580 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 102400 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 10240 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited -Original Message- From: Drake민영근lt;drake@nexr.comgt; To: userlt;user@hadoop.apache.orggt;; 조주일lt;tjst...@kgrid.co.krgt;; Cc: Sent: 2015-04-24 (금) 16:58:46 Subject: Re: rolling upgrade(2.4.1 to 2.6.0) problem HI, How about the ulimit setting of the user for hdfs datanode ?Drake 민영근 Ph.Dkt NexR On Wed, Apr 22, 2015 at 6:25 PM, 조주일 lt;tjst...@kgrid.co.krgt; wrote: I allocated 5G. I think OOM is not the cause of essentially -Original Message- From: Han-Cheol Cholt;hancheol@nhn-playart.comgt; To: lt;user@hadoop.apache.orggt;; Cc: Sent: 2015-04-22 (수) 15:32:35 Subject: RE: rolling upgrade(2.4.1 to 2.6.0) problem Hi, The first warning shows out-of-memory error of JVM. Did you give enough max heap memory for DataNode daemons? DN daemons, by default, uses max heap size 1GB. So if your DN requires more than that, it will be in a trouble. You can check the memory consumption of you DN dameons (e.g., top command) and the memory allocated to them by -Xmx option (e.g., jps -lmv). If the max heap size is too small, you can use HADOOP_DATANODE_OPTS variable (e.g., HADOOP_DATANODE_OPTS=-Xmx4g) to override it. Best wishes, Han-Cheol -Original Message- From: 조주일lt;tjst...@kgrid.co.krgt; To: lt;user@hadoop.apache.orggt;; Cc: Sent: 2015-04-22 (수) 14:54:16 Subject: rolling upgrade(2.4.1 to 2.6.0) problem My Cluster is.. hadoop 2.4.1 Capacity : 1.24PB Used 1.1PB 16 Datanodes Each node is a capacity of 65TB, 96TB, 80TB, Etc.. I had to proceed with the rolling upgrade 2.4.1 to 2.6.0. A data node upgraded takes about 40 minutes. Occurs during the upgrade is in progress under-block. 10 nodes completed upgrade 2.6.0. Had a problem at some point during a rolling upgrade
Re: rolling upgrade(2.4.1 to 2.6.0) problem
ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 62580 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 102400 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 10240 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited -- Hadoop cluster was operating normally in the 2.4.1 version. Hadoop cluster is a problem in version 2.6. E.g Slow BlockReceiver logs are often seen org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write data to disk cost If the data node failure and under-block occurs, another many nodes heartbeat check is fails. So, I stop all nodes and I start all nodes. The cluster is then normalized. In this regard, Hadoop Is there a difference between version 2.4 and 2.6? ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 62580 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 102400 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 10240 cpu time (seconds, -t) unlimited max user processes (-u) 10240 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited -Original Message- From: Drake민영근lt;drake@nexr.comgt; To: userlt;user@hadoop.apache.orggt;; 조주일lt;tjst...@kgrid.co.krgt;; Cc: Sent: 2015-04-24 (금) 16:58:46 Subject: Re: rolling upgrade(2.4.1 to 2.6.0) problem HI, How about the ulimit setting of the user for hdfs datanode ?Drake 민영근 Ph.Dkt NexR On Wed, Apr 22, 2015 at 6:25 PM, 조주일 lt;tjst...@kgrid.co.krgt; wrote: I allocated 5G. I think OOM is not the cause of essentially -Original Message- From: Han-Cheol Cholt;hancheol@nhn-playart.comgt; To: lt;user@hadoop.apache.orggt;; Cc: Sent: 2015-04-22 (수) 15:32:35 Subject: RE: rolling upgrade(2.4.1 to 2.6.0) problem Hi, The first warning shows out-of-memory error of JVM. Did you give enough max heap memory for DataNode daemons? DN daemons, by default, uses max heap size 1GB. So if your DN requires more than that, it will be in a trouble. You can check the memory consumption of you DN dameons (e.g., top command) and the memory allocated to them by -Xmx option (e.g., jps -lmv). If the max heap size is too small, you can use HADOOP_DATANODE_OPTS variable (e.g., HADOOP_DATANODE_OPTS=-Xmx4g) to override it. Best wishes, Han-Cheol -Original Message- From: 조주일lt;tjst...@kgrid.co.krgt; To: lt;user@hadoop.apache.orggt;; Cc: Sent: 2015-04-22 (수) 14:54:16 Subject: rolling upgrade(2.4.1 to 2.6.0) problem My Cluster is.. hadoop 2.4.1 Capacity : 1.24PB Used 1.1PB 16 Datanodes Each node is a capacity of 65TB, 96TB, 80TB, Etc.. I had to proceed with the rolling upgrade 2.4.1 to 2.6.0. A data node upgraded takes about 40 minutes. Occurs during the upgrade is in progress under-block. 10 nodes completed upgrade 2.6.0. Had a problem at some point during a rolling upgrade of the remaining nodes. Heartbeat of the many nodes(2.6.0 only) has failed. I did changes the following attributes but I did not fix the problem, dfs.datanode.handler.count = 100 ---gt; 300, 400, 500 dfs.datanode.max.transfer.threads = 4096 ---gt; 8000, 1 I think, 1. Something that causes a delay in processing threads. I think it may be because the block replication between different versions. 2. Whereby the many handlers and xceiver became necessary. 3. Whereby the out of memory, an error occurs. Or the problem arises on a datanode. 4. Heartbeat fails, and datanode dies. I found a datanode error log for the following: However, it is impossible to determine the cause. I think, therefore I am. Called because it blocks the replication between different versions Give me someone help me !! DATANODE LOG -- ### I had to check a few thousand close_wait connection from the datanode. org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 1207ms (threshold=300ms) 2015-04-21 22:46:01,772 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. Will retry in 30 seconds
Downgrade without downtime is not possible?
http://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html Rolling Upgrade, 2.4.1 to 2.6.0 Downgrade without downtime, 2.6.0 to 2.4.1 Downgrade without downtime is not possible? org.apache.hadoop.hdfs.server.datanode.DataNode: Reported NameNode version '2.6.0' does not match DataNode version '2.4.1' but is within acceptable limits. Note: This is normal during a rolling upgrade.
RE: rolling upgrade(2.4.1 to 2.6.0) problem
I allocated 5G. I think OOM is not the cause of essentially -Original Message- From: Han-Cheol Cholt;hancheol@nhn-playart.comgt; To: lt;user@hadoop.apache.orggt;; Cc: Sent: 2015-04-22 (수) 15:32:35 Subject: RE: rolling upgrade(2.4.1 to 2.6.0) problem Hi, The first warning shows out-of-memory error of JVM. Did you give enough max heap memory for DataNode daemons? DN daemons, by default, uses max heap size 1GB. So if your DN requires more than that, it will be in a trouble. You can check the memory consumption of you DN dameons (e.g., top command) and the memory allocated to them by -Xmx option (e.g., jps -lmv). If the max heap size is too small, you can use HADOOP_DATANODE_OPTS variable (e.g., HADOOP_DATANODE_OPTS=-Xmx4g) to override it. Best wishes, Han-Cheol -Original Message- From: 조주일lt;tjst...@kgrid.co.krgt; To: lt;user@hadoop.apache.orggt;; Cc: Sent: 2015-04-22 (수) 14:54:16 Subject: rolling upgrade(2.4.1 to 2.6.0) problem My Cluster is.. hadoop 2.4.1 Capacity : 1.24PB Used 1.1PB 16 Datanodes Each node is a capacity of 65TB, 96TB, 80TB, Etc.. I had to proceed with the rolling upgrade 2.4.1 to 2.6.0. A data node upgraded takes about 40 minutes. Occurs during the upgrade is in progress under-block. 10 nodes completed upgrade 2.6.0. Had a problem at some point during a rolling upgrade of the remaining nodes. Heartbeat of the many nodes(2.6.0 only) has failed. I did changes the following attributes but I did not fix the problem, dfs.datanode.handler.count = 100 ---gt; 300, 400, 500 dfs.datanode.max.transfer.threads = 4096 ---gt; 8000, 1 I think, 1. Something that causes a delay in processing threads. I think it may be because the block replication between different versions. 2. Whereby the many handlers and xceiver became necessary. 3. Whereby the out of memory, an error occurs. Or the problem arises on a datanode. 4. Heartbeat fails, and datanode dies. I found a datanode error log for the following: However, it is impossible to determine the cause. I think, therefore I am. Called because it blocks the replication between different versions Give me someone help me !! DATANODE LOG -- ### I had to check a few thousand close_wait connection from the datanode. org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 1207ms (threshold=300ms) 2015-04-21 22:46:01,772 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. Will retry in 30 seconds. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:640) at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:145) at java.lang.Thread.run(Thread.java:662) 2015-04-21 22:49:45,378 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver count 8193 exceeds the limit of concurrent xcievers: 8192 at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140) at java.lang.Thread.run(Thread.java:662) 2015-04-22 01:01:25,632 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver count 8193 exceeds the limit of concurrent xcievers: 8192 at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140) at java.lang.Thread.run(Thread.java:662) 2015-04-22 03:49:44,125 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode-192.168.1.204:40010:DataXceiver error processing READ_BLOCK operation src: /192.168.2.174:45606 dst: /192.168.1.204:40010 java.io.IOException: cannot find BPOfferService for bpid=BP-1770955034-0.0.0.0-1401163460236 at org.apache.hadoop.hdfs.server.datanode.DataNode.getDNRegistrationForBP(DataNode.java:1387) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:470) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235) at java.lang.Thread.run(Thread.java:662) 2015-04-22 05:30:28,947 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.1.203, datanodeUuid=654f22ef-84b3-4ecb-a959-2ea46d817c19, infoPort=40075, ipcPort=40020, storageInfo=lv=-56;cid=CID-CLUSTER;nsid=239138164;c=1404883838982):Failed to transfer BP-1770955034-0.0.0.0-1401163460236:blk_1075354042_1613403 to 192.168.2.156:40010 got java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer at
rolling upgrade(2.4.1 to 2.6.0) problem
My Cluster is.. hadoop 2.4.1 Capacity : 1.24PB Used 1.1PB 16 Datanodes Each node is a capacity of 65TB, 96TB, 80TB, Etc.. I had to proceed with the rolling upgrade 2.4.1 to 2.6.0. A data node upgraded takes about 40 minutes. Occurs during the upgrade is in progress under-block. 10 nodes completed upgrade 2.6.0. Had a problem at some point during a rolling upgrade of the remaining nodes. Heartbeat of the many nodes(2.6.0 only) has failed. I did changes the following attributes but I did not fix the problem, dfs.datanode.handler.count = 100 ---gt; 300, 400, 500 dfs.datanode.max.transfer.threads = 4096 ---gt; 8000, 1 I think, 1. Something that causes a delay in processing threads. I think it may be because the block replication between different versions. 2. Whereby the many handlers and xceiver became necessary. 3. Whereby the out of memory, an error occurs. Or the problem arises on a datanode. 4. Heartbeat fails, and datanode dies. I found a datanode error log for the following: However, it is impossible to determine the cause. I think, therefore I am. Called because it blocks the replication between different versions Give me someone help me !! DATANODE LOG -- ### I had to check a few thousand close_wait connection from the datanode. org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write packet to mirror took 1207ms (threshold=300ms) 2015-04-21 22:46:01,772 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory. Will retry in 30 seconds. java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:640) at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:145) at java.lang.Thread.run(Thread.java:662) 2015-04-21 22:49:45,378 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver count 8193 exceeds the limit of concurrent xcievers: 8192 at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140) at java.lang.Thread.run(Thread.java:662) 2015-04-22 01:01:25,632 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver count 8193 exceeds the limit of concurrent xcievers: 8192 at org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140) at java.lang.Thread.run(Thread.java:662) 2015-04-22 03:49:44,125 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: datanode-192.168.1.204:40010:DataXceiver error processing READ_BLOCK operation src: /192.168.2.174:45606 dst: /192.168.1.204:40010 java.io.IOException: cannot find BPOfferService for bpid=BP-1770955034-0.0.0.0-1401163460236 at org.apache.hadoop.hdfs.server.datanode.DataNode.getDNRegistrationForBP(DataNode.java:1387) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:470) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235) at java.lang.Thread.run(Thread.java:662) 2015-04-22 05:30:28,947 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(192.168.1.203, datanodeUuid=654f22ef-84b3-4ecb-a959-2ea46d817c19, infoPort=40075, ipcPort=40020, storageInfo=lv=-56;cid=CID-CLUSTER;nsid=239138164;c=1404883838982):Failed to transfer BP-1770955034-0.0.0.0-1401163460236:blk_1075354042_1613403 to 192.168.2.156:40010 got java.net.SocketException: Original Exception : java.io.IOException: Connection reset by peer at sun.nio.ch.FileChannelImpl.transferTo0(Native Method) at sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:405) at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:506) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:559) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:728) at org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2017) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.IOException: Connection reset by peer ... 8 more
How quickly can I increase the number of replicas?
My cluser version hdfs 2.2 stable ( 2 ha namenodes, 10 datanodes) I was command bin/hdfs dfs -setrep -R 2 / ( replication 1 to 2 ) I found that HDFS is actually replicating the under replicated blocks but it works very slowly. HDFS performs the replication about 1 block per second. I have about 40 under replicated blocks so it will take about 4 more days. Is there any way to speed it up?
how to replication speed up
hi. my cluster is . 2 ha namenode 8 datanodes Occurred under the block of 506,803. 1000 block of 10 minutes will be replicated. 600 megabytes of traffic per server occurs. Will take much longer until complete. How can I increase the replication rate.
hdfs 2.2 federation
ns01 = HA( namenode194-active, namenode195-standby ) ns02 = HA( namenode196-active, namenode197-standby ) I was running on the server command 195 I was running the command(-safemode enter) namenode194 server $HADOOP_PREFIX/bin/hdfs dfsadmin -safemode enter I was expecting to change safemode both namenode194 and namenode195 However, only changes namenode195. What's the problem?
namenode log Inconsistent size
Occurs when uploading. Logs are generated in any situation? It is dangerous problem? * hadoop version 1.1.2 * namenode log 2014-04-17 09:30:34,280 INFO namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2374)) - commitBlockSynchronization(lastblock=blk_-8030112303433878733_9684823, newgenerationstamp=9684843, newlength=134079488, newtargets=[10.1.1.83:40010], closeFile=false, deleteBlock=false) 2014-04-17 09:30:34,281 INFO namenode.FSNamesystem (FSNamesystem.java:commitBlockSynchronization(2449)) - commitBlockSynchronization(blk_-8030112303433878733_9684843) successful 2014-04-17 09:30:34,293 WARN namenode.FSNamesystem (FSNamesystem.java:addStoredBlock(3772)) - Inconsistent size for block blk_-8030112303433878733_9684843 reported from 10.1.1.83:40010 current size is 134079488 reported size is 134217728 * upload client application log 2014-04-17 05:25:52,264 Error Recovery for block blk_3429074513675384087_9643926 bad datanode[1] 10.1.1.89:40010 2014-04-17 05:25:52,265 Error Recovery for block blk_3429074513675384087_9643926 in pipeline 10.1.1.88:40010, 10.1.1.89:40010: bad datanode 10.1.1.89:40010 2014-04-17 08:00:25,687 DFSOutputStream ResponseProcessor exception for block blk_5524267667730620785_9667404java.io.IOException: Bad response 1 for block blk_5524267667730620785_9667404 from datanode 10.1.1.89:40010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3137) 2014-04-17 08:00:25,693 Error Recovery for block blk_5524267667730620785_9667404 bad datanode[1] 10.1.1.89:40010 2014-04-17 08:00:25,693 Error Recovery for block blk_5524267667730620785_9667404 in pipeline 10.1.1.88:40010, 10.1.1.89:40010: bad datanode 10.1.1.89:40010 2014-04-17 09:39:31,592 DFSOutputStream ResponseProcessor exception for block blk_4983705690115250466_9687203java.io.IOException: Bad response 1 for block blk_4983705690115250466_9687203 from datanode 10.1.1.89:40010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3137)