Is the uptime of RS "normal"? No quick and global reboot that could lead into a regiongi-reallocation-storm?
On 3 September 2015 at 18:42, Akmal Abbasov <akmal.abba...@icloud.com> wrote: > Hi Adrien, > I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also > hbase is consistent. > I’m using default value of the replication, so it is 3. > There are some under replicated > HBase master(node 10.10.8.55) is reading constantly from regionservers. > Only today, it send >150.000 HDFS_READ requests to each regionserver so > far, while the hbase cluster is almost idle. > What could cause this kind of behaviour? > > p.s. each node in the cluster have 2 core, 4 gb ram, just in case. > > Thanks. > > > On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.moge...@contentsquare.com> > wrote: > > Is your HDFS healthy (fsck /)? > > Same for hbase hbck? > > What's your replication level? > > Can you see constant network use as well? > > Anything than might be triggered by the hbasemaster? (something like a > virtually dead RS, due to ZK race-condition, etc.) > > Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major > compaction, successfully, yesterday. > > On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abba...@icloud.com> > wrote: > >> I’ve started HDFS balancer, but then stopped it immediately after knowing >> that it is not a good idea. >> but it was around 3 weeks ago, is it possible that it had an influence on >> the cluster behaviour I’m having now? >> Thanks. >> >> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abba...@icloud.com> wrote: >> >> Hi Ted, >> No there is no short-circuit read configured. >> The logs of datanode of the 10.10.8.55 are full of following messages >> 2015-09-03 12:03:56,324 INFO >> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 77, op: HDFS_READ, >> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: >> 276448307 >> 2015-09-03 12:03:56,494 INFO >> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 538, op: HDFS_READ, >> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: >> 60550244 >> 2015-09-03 12:03:59,561 INFO >> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >> 10.10.8.55:50010, dest: /10.10.8.53:58622, bytes: 455, op: HDFS_READ, >> cliID: DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: >> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: >> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: >> 755613819 >> There are >100.000 of them just for today. The situation with other >> regionservers are similar. >> Node 10.10.8.53 is hbase-master node, and the process on the port is also >> hbase-master. >> So if there is no load on the cluster, why there are so much IO happening? >> Any thoughts. >> Thanks. >> >> On 02 Sep 2015, at 21:57, Ted Yu <yuzhih...@gmail.com> wrote: >> >> I assume you have enabled short-circuit read. >> >> Can you capture region server stack trace(s) and pastebin them ? >> >> Thanks >> >> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abba...@icloud.com> >> wrote: >> >>> Hi Ted, >>> I’ve checked the time when addresses were changed, and this strange >>> behaviour started weeks before it. >>> >>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master. >>> any thoughts? >>> >>> Thanks >>> >>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>> bq. change the ip addresses of the cluster nodes >>> >>> Did this happen recently ? If high iowait was observed after the change >>> (you can look at ganglia graph), there is a chance that the change was >>> related. >>> >>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region >>> server resides. >>> >>> Cheers >>> >>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abba...@icloud.com> >>> wrote: >>> >>>> Hi Ted, >>>> sorry forget to mention >>>> >>>> release of hbase / hadoop you're using >>>> >>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1 >>>> >>>> were region servers doing compaction ? >>>> >>>> I’ve run major compactions manually earlier today, but it seems that >>>> they already completed, looking at the compactionQueueSize. >>>> >>>> have you checked region server logs ? >>>> >>>> The logs of datanode is full of this kind of messages >>>> 2015-09-02 16:37:06,950 INFO >>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: / >>>> 10.10.8.55:50010, dest: /10.10.8.54:32959, bytes: 19673, op: >>>> HDFS_READ, cliID: DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: >>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: >>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: >>>> 7881815 >>>> >>>> p.s. we had to change the ip addresses of the cluster nodes, is it >>>> relevant? >>>> >>>> Thanks. >>>> >>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>> Please provide some more information: >>>> >>>> release of hbase / hadoop you're using >>>> were region servers doing compaction ? >>>> have you checked region server logs ? >>>> >>>> Thanks >>>> >>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abba...@icloud.com >>>> > wrote: >>>> >>>>> Hi, >>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only >>>>> <5 puts and gets. >>>>> But the data in hdfs is increasing, and region servers have very high >>>>> iowait(>100, in 2 core CPU). >>>>> iotop shows that datanode process is reading and writing all the time. >>>>> Any suggestions? >>>>> >>>>> Thanks. >>>> >>>> >>>> >>>> >>> >>> >> >> >> > > > -- > > *Adrien Mogenet* > Head of Backend/Infrastructure > adrien.moge...@contentsquare.com > (+33)6.59.16.64.22 > http://www.contentsquare.com > 50, avenue Montaigne - 75008 Paris > > > -- *Adrien Mogenet* Head of Backend/Infrastructure adrien.moge...@contentsquare.com (+33)6.59.16.64.22 http://www.contentsquare.com 50, avenue Montaigne - 75008 Paris