Re: rolling upgrade(2.4.1 to 2.6.0) problem

Nitin Pawar Mon, 27 Apr 2015 01:47:14 -0700

I had read somewhere 2.7 has lots of issues so you should wait for 2.7.1
where most of them are getting addressed


On Mon, Apr 27, 2015 at 2:14 PM, 조주일 <tjst...@kgrid.co.kr> wrote:

>
>
> I think heartbeat failure cause is hang of nodes.
>
> I found a bug report associated with this problem.
>
>
>
> https://issues.apache.org/jira/browse/HDFS-7489
>
> https://issues.apache.org/jira/browse/HDFS-7496
>
> https://issues.apache.org/jira/browse/HDFS-7531
>
> https://issues.apache.org/jira/browse/HDFS-8051
>
>
>
> It has been fixed in 2.7.
>
>
>
> I do not have experience patch.
>
> And Because of this stability has not been confirmed, I can not upgrade
> to 2.7.
>
>
>
> What do you recommend for that?
>
>
>
> How can I do the patch, if I will do patch?
>
> Can I patch without service dowtime.
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> *From:* "Drake민영근"<drake....@nexr.com>
> *To:* "user"<user@hadoop.apache.org>; "조주일"<tjst...@kgrid.co.kr>;
> *Cc:*
> *Sent:* 2015-04-24 (금) 17:41:59
> *Subject:* Re: rolling upgrade(2.4.1 to 2.6.0) problem
>
>
> Hi,
>
> I think limited by "max user processes". see this:
> https://plumbr.eu/outofmemoryerror/unable-to-create-new-native-thread In
> your case, user cannot create more than 10240 processes. In our env, the
> limit is more like "65000".
>
> I think it's worth a try. And, if hdfs datanode daemon's user is not root,
> set the limit file into /etc/security/limits.d
>
> Thanks.
>
> Drake 민영근 Ph.D
> kt NexR
>
> On Fri, Apr 24, 2015 at 5:15 PM, 조주일 <tjst...@kgrid.co.kr> wrote:
>
> ulimit -a
>
> core file size          (blocks, -c) 0
>
> data seg size           (kbytes, -d) unlimited
>
> scheduling priority             (-e) 0
>
> file size               (blocks, -f) unlimited
>
> pending signals                 (-i) 62580
>
> max locked memory       (kbytes, -l) 64
>
> max memory size         (kbytes, -m) unlimited
>
> open files                      (-n) 102400
>
> pipe size            (512 bytes, -p) 8
>
> POSIX message queues     (bytes, -q) 819200
>
> real-time priority              (-r) 0
>
> stack size              (kbytes, -s) 10240
>
> cpu time               (seconds, -t) unlimited
>
> max user processes              (-u) 10240
>
> virtual memory          (kbytes, -v) unlimited
>
> file locks                      (-x) unlimited
>
>
>
> ------------------------------------------------------
>
> Hadoop cluster was operating normally in the 2.4.1 version.
>
> Hadoop cluster is a problem in version 2.6.
>
>
>
> E.g
>
>
>
> Slow BlockReceiver logs are often seen
>
> "org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write
> data to disk cost"
>
>
>
> If the data node failure and under-block occurs,
>
> another many nodes heartbeat check is fails.
>
> So, I stop all nodes and I start all nodes.
>
> The cluster is then normalized.
>
>
>
> In this regard, Hadoop Is there a difference between version 2.4 and 2.6?
>
>
>
>
>
> ulimit -a
>
> core file size          (blocks, -c) 0
>
> data seg size           (kbytes, -d) unlimited
>
> scheduling priority             (-e) 0
>
> file size               (blocks, -f) unlimited
>
> pending signals                 (-i) 62580
>
> max locked memory       (kbytes, -l) 64
>
> max memory size         (kbytes, -m) unlimited
>
> open files                      (-n) 102400
>
> pipe size            (512 bytes, -p) 8
>
> POSIX message queues     (bytes, -q) 819200
>
> real-time priority              (-r) 0
>
> stack size              (kbytes, -s) 10240
>
> cpu time               (seconds, -t) unlimited
>
> max user processes              (-u) 10240
>
> virtual memory          (kbytes, -v) unlimited
>
> file locks                      (-x) unlimited
>
>
>
>
>
> -----Original Message-----
> *From:* "Drake민영근"<drake....@nexr.com>
> *To:* "user"<user@hadoop.apache.org>; "조주일"<tjst...@kgrid.co.kr>;
> *Cc:*
> *Sent:* 2015-04-24 (금) 16:58:46
> *Subject:* Re: rolling upgrade(2.4.1 to 2.6.0) problem
>
> HI,
>
> How about the ulimit setting of the user for hdfs datanode ?
>
> Drake 민영근 Ph.D
> kt NexR
>
> On Wed, Apr 22, 2015 at 6:25 PM, 조주일 <tjst...@kgrid.co.kr> wrote:
>
>
>
> I allocated 5G.
>
> I think OOM is not the cause of essentially
>
>
>
> -----Original Message-----
> *From:* "Han-Cheol Cho"<hancheol....@nhn-playart.com>
> *To:* <user@hadoop.apache.org>;
> *Cc:*
> *Sent:* 2015-04-22 (수) 15:32:35
> *Subject:* RE: rolling upgrade(2.4.1 to 2.6.0) problem
>
>
> Hi,
>
>
>
> The first warning shows out-of-memory error of JVM.
>
> Did you give enough max heap memory for DataNode daemons?
>
> DN daemons, by default, uses max heap size 1GB. So if your DN requires
> more
>
> than that, it will be in a trouble.
>
>
> You can check the memory consumption of you DN dameons (e.g., top
> command)
>
> and the memory allocated to them by -Xmx option (e.g., jps -lmv).
>
> If the max heap size is too small, you can use HADOOP_DATANODE_OPTS
> variable
>
> (e.g., HADOOP_DATANODE_OPTS="-Xmx4g") to override it.
>
>
>
> Best wishes,
>
> Han-Cheol
>
>
>
>
>
>
>
>
>
>
>
> -----Original Message-----
> *From:* "조주일"<tjst...@kgrid.co.kr>
> *To:* <user@hadoop.apache.org>;
> *Cc:*
> *Sent:* 2015-04-22 (수) 14:54:16
> *Subject:* rolling upgrade(2.4.1 to 2.6.0) problem
>
>
>
>
> My Cluster is..
>
> hadoop 2.4.1
>
> Capacity : 1.24PB
>
> Used 1.1PB
>
> 16 Datanodes
>
> Each node is a capacity of 65TB, 96TB, 80TB, Etc..
>
>
>
> I had to proceed with the rolling upgrade 2.4.1 to 2.6.0.
>
> A data node upgraded takes about 40 minutes.
>
> Occurs during the upgrade is in progress under-block.
>
>
>
> 10 nodes completed upgrade 2.6.0.
>
> Had a problem at some point during a rolling upgrade of the remaining
> nodes.
>
>
>
> Heartbeat of the many nodes(2.6.0 only) has failed.
>
>
>
> I did changes the following attributes but I did not fix the problem,
>
> dfs.datanode.handler.count = 100 ---> 300, 400, 500
>
> dfs.datanode.max.transfer.threads = 4096 ---> 8000, 10000
>
>
>
> I think,
>
> 1. Something that causes a delay in processing threads. I think it may be
> because the block replication between different versions.
>
> 2. Whereby the many handlers and xceiver became necessary.
>
> 3. Whereby the out of memory, an error occurs. Or the problem arises on a
> datanode.
>
> 4. Heartbeat fails, and datanode dies.
>
>
> I found a datanode error log for the following:
>
> However, it is impossible to determine the cause.
>
>
>
> I think, therefore I am. Called because it blocks the replication between
> different versions
>
>
>
> Give me someone help me !!
>
>
>
> DATANODE LOG
>
> --------------------------------------------------------------------------
>
> ### I had to check a few thousand close_wait connection from the datanode.
>
>
>
> org.apache.hadoop.hdfs.server.datanode.DataNode: Slow BlockReceiver write
> packet to mirror took 1207ms (threshold=300ms)
>
>
>
> 2015-04-21 22:46:01,772 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: DataNode is out of memory.
> Will retry in 30 seconds.
>
> java.lang.OutOfMemoryError: unable to create new native thread
>
>         at java.lang.Thread.start0(Native Method)
>
>         at java.lang.Thread.start(Thread.java:640)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:145)
>
>         at java.lang.Thread.run(Thread.java:662)
>
> 2015-04-21 22:49:45,378 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver
> count 8193 exceeds the limit of concurrent xcievers: 8192
>
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140)
>
>         at java.lang.Thread.run(Thread.java:662)
>
> 2015-04-22 01:01:25,632 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> datanode-192.168.1.207:40010:DataXceiverServer:java.io.IOException: Xceiver
> count 8193 exceeds the limit of concurrent xcievers: 8192
>
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiverServer.run(DataXceiverServer.java:140)
>
>         at java.lang.Thread.run(Thread.java:662)
>
> 2015-04-22 03:49:44,125 ERROR
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> datanode-192.168.1.204:40010:DataXceiver error processing READ_BLOCK
> operation  src: /192.168.2.174:45606 dst: /192.168.1.204:40010
>
> java.io.IOException: cannot find BPOfferService for
> bpid=BP-1770955034-0.0.0.0-1401163460236
>
>         at
> org.apache.hadoop.hdfs.server.datanode.DataNode.getDNRegistrationForBP(DataNode.java:1387)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:470)
>
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReadBlock(Receiver.java:116)
>
>         at
> org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:71)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:235)
>
>         at java.lang.Thread.run(Thread.java:662)
>
> 2015-04-22 05:30:28,947 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode:
> DatanodeRegistration(192.168.1.203,
> datanodeUuid=654f22ef-84b3-4ecb-a959-2ea46d817c19, infoPort=40075,
> ipcPort=40020,
> storageInfo=lv=-56;cid=CID-CLUSTER;nsid=239138164;c=1404883838982):Failed
> to transfer BP-1770955034-0.0.0.0-1401163460236:blk_1075354042_1613403 to
> 192.168.2.156:40010 got
>
> java.net.SocketException: Original Exception : java.io.IOException:
> Connection reset by peer
>
>         at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
>
>         at
> sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:405)
>
>         at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:506)
>
>         at
> org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:223)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendPacket(BlockSender.java:559)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:728)
>
>         at
> org.apache.hadoop.hdfs.server.datanode.DataNode$DataTransfer.run(DataNode.java:2017)
>
>         at java.lang.Thread.run(Thread.java:662)
>
> Caused by: java.io.IOException: Connection reset by peer
>
>         ... 8 more
>
>
>
>
>
>
>
>
>



-- 
Nitin Pawar

Re: rolling upgrade(2.4.1 to 2.6.0) problem

Reply via email to