Re: 1 of 20 TServers unresponsive/slow, all writes fail?

dlmarion Fri, 09 Sep 2016 08:06:49 -0700

We have seen this before: a tserver that is hosting metadata tablets has issues 
and starts causing problems within the cluster. You could try using the 
HostRegexTableLoadBalancer[1,2] to segregate your metadata tablets from the 
other tables. This doesn't fully eliminate the SPOF, but it should help to 
ensure that the tablet servers hosting the metadata tablets are not busy doing 
work for other tables.


To do this you would do the following in the shell, then restart the master: 

1) Set the 'master.tablet.balancer' property to the HostRegexTableLoadBalancer 
class name 
2) Set the property 
'table.custom.balancer.host.regex.accumulo.metadata=<regex>' 
3) Set other HostRegexTableLoadBalancer properties if desired 

[1] https://issues.apache.org/jira/browse/ACCUMULO-4173 
[2] 
https://github.com/apache/accumulo/blob/rel/1.7.2/server/base/src/main/java/org/apache/accumulo/server/master/balancer/HostRegexTableLoadBalancer.java
 

----- Original Message -----

From: "Michael Moss" <[email protected]> 
To: [email protected] 
Cc: "Michael Moss" <[email protected]> 
Sent: Friday, September 9, 2016 10:44:44 AM 
Subject: Re: 1 of 20 TServers unresponsive/slow, all writes fail? 

1.7.2 (client still 1.6.2). 

I think its an overall design issue, no? Serving metadata is a SPOF? 

On Fri, Sep 9, 2016 at 10:41 AM, Christopher < [email protected] > wrote: 



What version of Accumulo? Could narrow down the search for known issue 
potentials. 

On Fri, Sep 9, 2016 at 10:36 AM Michael Moss < [email protected] > wrote: 

<blockquote>

Upon further internal discussion, it looks like the metadata/root tables are 
served from the tservers (not an HA master for example) and the one in question 
was serving it. It was unable to run MajC (compaction) for many hours leading 
up to the time where it couldn't service requests any longer, but it was still 
up, hosting tablets, just very slow or unable to respond. So all writes ended 
up timing out. 

If this condition is possible and there is a SPOF here, it'd be good to see 
what's on the roadmap to address it. 

On Fri, Sep 9, 2016 at 10:24 AM, < [email protected] > wrote: 

<blockquote>

What was happening on that 1 tserver? Was it in garbage collection? Was it 
having network or O/S issues? 


From: "Michael Moss (BLOOMBERG/ 731 LEX)" < [email protected] > 
To: [email protected] 
Sent: Friday, September 9, 2016 9:40:42 AM 
Subject: 1 of 20 TServers unresponsive/slow, all writes fail? 


Hi, 

We are starting to investigate an issue where 1 tserver was up, but became 
slow/unresponsive for several hours, yet all writes to our 20+ servers began to 
fail. We could see leading up to the failure that the writes were distributed 
among all of the tablet servers, so it wasn't a hotspot. Whenever we receive a 
MutationsRejectedException, we recreate the BatchWriter (ACCUMULO-2990). I'm 
digging into the TabletServerBatchWriter code, but any ideas what could cause 
this issue? Is there some sort of initialization or healthchecking that the 
client does where 1 server could impact all? 

Thanks. 

-Mike 

Caused by: org.apache.accumulo.core.client.TimedOutException: Servers timed out 
[ pnj-bvlt-r4n03.abc.com:31113 ] at 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.java:177)
 ~[stormjar.jar:1.0] at 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182)
 ~[stormjar.jar:1.0] at 
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:933)
 ~[stormjar.jar:1.0] at 






</blockquote>


</blockquote>

Re: 1 of 20 TServers unresponsive/slow, all writes fail?

Reply via email to