Re: High iowait in idle hbase cluster

Akmal Abbasov Thu, 03 Sep 2015 09:43:53 -0700

Hi Adrien,
I’ve tried to run hdfs fsck and hbase hbck, and hdfs is healthy, also hbase is 
consistent.
I’m using default value of the replication, so it is 3.
There are some under replicated 
HBase master(node 10.10.8.55) is reading constantly from regionservers. Only 
today, it send >150.000 HDFS_READ requests to each regionserver so far, while 
the hbase cluster is almost idle.
What could cause this kind of behaviour?


p.s. each node in the cluster have 2 core, 4 gb ram, just in case.

Thanks.


> On 03 Sep 2015, at 17:46, Adrien Mogenet <adrien.moge...@contentsquare.com> 
> wrote:
> 
> Is your HDFS healthy (fsck /)?
> 
> Same for hbase hbck?
> 
> What's your replication level?
> 
> Can you see constant network use as well?
> 
> Anything than might be triggered by the hbasemaster? (something like a 
> virtually dead RS, due to ZK race-condition, etc.)
> 
> Your 3-weeks-ago balancer shouldn't have any effect if you've ran a major 
> compaction, successfully, yesterday.
> 
> On 3 September 2015 at 16:32, Akmal Abbasov <akmal.abba...@icloud.com 
> <mailto:akmal.abba...@icloud.com>> wrote:
> I’ve started HDFS balancer, but then stopped it immediately after knowing 
> that it is not a good idea.
> but it was around 3 weeks ago, is it possible that it had an influence on the 
> cluster behaviour I’m having now?
> Thanks.
> 
>> On 03 Sep 2015, at 14:23, Akmal Abbasov <akmal.abba...@icloud.com 
>> <mailto:akmal.abba...@icloud.com>> wrote:
>> 
>> Hi Ted,
>> No there is no short-circuit read configured.
>> The logs of datanode of the 10.10.8.55 are full of following messages
>> 2015-09-03 12:03:56,324 INFO 
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 
>> <http://10.10.8.53:58622/>, bytes: 77, op: HDFS_READ, cliID: 
>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: 
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: 
>> BP-439084760-10.32.0.180-1387281790961:blk_1075349331_1612273, duration: 
>> 276448307
>> 2015-09-03 12:03:56,494 INFO 
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 
>> <http://10.10.8.53:58622/>, bytes: 538, op: HDFS_READ, cliID: 
>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: 
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: 
>> BP-439084760-10.32.0.180-1387281790961:blk_1075349334_1612276, duration: 
>> 60550244
>> 2015-09-03 12:03:59,561 INFO 
>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.53:58622 
>> <http://10.10.8.53:58622/>, bytes: 455, op: HDFS_READ, cliID: 
>> DFSClient_NONMAPREDUCE_-483065515_1, offset: 0, srvID: 
>> ee7d0634-89a3-4ada-a8ad-7848214397be, blockid: 
>> BP-439084760-10.32.0.180-1387281790961:blk_1075351814_1614757, duration: 
>> 755613819
>> There are >100.000 of them just for today. The situation with other 
>> regionservers are similar.
>> Node 10.10.8.53 is hbase-master node, and the process on the port is also 
>> hbase-master.
>> So if there is no load on the cluster, why there are so much IO happening?
>> Any thoughts.
>> Thanks.
>> 
>>> On 02 Sep 2015, at 21:57, Ted Yu <yuzhih...@gmail.com 
>>> <mailto:yuzhih...@gmail.com>> wrote:
>>> 
>>> I assume you have enabled short-circuit read.
>>> 
>>> Can you capture region server stack trace(s) and pastebin them ?
>>> 
>>> Thanks
>>> 
>>> On Wed, Sep 2, 2015 at 12:11 PM, Akmal Abbasov <akmal.abba...@icloud.com 
>>> <mailto:akmal.abba...@icloud.com>> wrote:
>>> Hi Ted,
>>> I’ve checked the time when addresses were changed, and this strange 
>>> behaviour started weeks before it.
>>> 
>>> yes, 10.10.8.55 is region server and 10.10.8.54 is a hbase master.
>>> any thoughts?
>>> 
>>> Thanks
>>> 
>>>> On 02 Sep 2015, at 18:45, Ted Yu <yuzhih...@gmail.com 
>>>> <mailto:yuzhih...@gmail.com>> wrote:
>>>> 
>>>> bq. change the ip addresses of the cluster nodes
>>>> 
>>>> Did this happen recently ? If high iowait was observed after the change 
>>>> (you can look at ganglia graph), there is a chance that the change was 
>>>> related.
>>>> 
>>>> BTW I assume 10.10.8.55 <http://10.10.8.55:50010/> is where your region 
>>>> server resides.
>>>> 
>>>> Cheers
>>>> 
>>>> On Wed, Sep 2, 2015 at 9:39 AM, Akmal Abbasov <akmal.abba...@icloud.com 
>>>> <mailto:akmal.abba...@icloud.com>> wrote:
>>>> Hi Ted,
>>>> sorry forget to mention
>>>> 
>>>>> release of hbase / hadoop you're using
>>>> 
>>>> hbase hbase-0.98.7-hadoop2, hadoop hadoop-2.5.1
>>>> 
>>>>> were region servers doing compaction ?
>>>> 
>>>> I’ve run major compactions manually earlier today, but it seems that they 
>>>> already completed, looking at the compactionQueueSize.
>>>> 
>>>>> have you checked region server logs ?
>>>> The logs of datanode is full of this kind of messages
>>>> 2015-09-02 16:37:06,950 INFO 
>>>> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
>>>> /10.10.8.55:50010 <http://10.10.8.55:50010/>, dest: /10.10.8.54:32959 
>>>> <http://10.10.8.54:32959/>, bytes: 19673, op: HDFS_READ, cliID: 
>>>> DFSClient_NONMAPREDUCE_1225374853_1, offset: 0, srvID: 
>>>> ee7d0634-89a3-4ada-a8ad-7848217327be, blockid: 
>>>> BP-329084760-10.32.0.180-1387281790961:blk_1075277914_1540222, duration: 
>>>> 7881815
>>>> 
>>>> p.s. we had to change the ip addresses of the cluster nodes, is it 
>>>> relevant?
>>>> 
>>>> Thanks.
>>>> 
>>>>> On 02 Sep 2015, at 18:20, Ted Yu <yuzhih...@gmail.com 
>>>>> <mailto:yuzhih...@gmail.com>> wrote:
>>>>> 
>>>>> Please provide some more information:
>>>>> 
>>>>> release of hbase / hadoop you're using
>>>>> were region servers doing compaction ?
>>>>> have you checked region server logs ?
>>>>> 
>>>>> Thanks
>>>>> 
>>>>> On Wed, Sep 2, 2015 at 9:11 AM, Akmal Abbasov <akmal.abba...@icloud.com 
>>>>> <mailto:akmal.abba...@icloud.com>> wrote:
>>>>> Hi,
>>>>> I’m having strange behaviour in hbase cluster. It is almost idle, only <5 
>>>>> puts and gets.
>>>>> But the data in hdfs is increasing, and region servers have very high 
>>>>> iowait(>100, in 2 core CPU).
>>>>> iotop shows that datanode process is reading and writing all the time.
>>>>> Any suggestions?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
> 
> 
> 
> 
> -- 
> 
> Adrien Mogenet
> Head of Backend/Infrastructure
> adrien.moge...@contentsquare.com <mailto:adrien.moge...@contentsquare.com>
> (+33)6.59.16.64.22
> http://www.contentsquare.com <http://www.contentsquare.com/>
> 50, avenue Montaigne - 75008 Paris

Re: High iowait in idle hbase cluster

Reply via email to