We have seen this before: a tserver that is hosting metadata tablets has issues and starts causing problems within the cluster. You could try using the HostRegexTableLoadBalancer[1,2] to segregate your metadata tablets from the other tables. This doesn't fully eliminate the SPOF, but it should help to ensure that the tablet servers hosting the metadata tablets are not busy doing work for other tables.
To do this you would do the following in the shell, then restart the master: 1) Set the 'master.tablet.balancer' property to the HostRegexTableLoadBalancer class name 2) Set the property 'table.custom.balancer.host.regex.accumulo.metadata=<regex>' 3) Set other HostRegexTableLoadBalancer properties if desired [1] https://issues.apache.org/jira/browse/ACCUMULO-4173 [2] https://github.com/apache/accumulo/blob/rel/1.7.2/server/base/src/main/java/org/apache/accumulo/server/master/balancer/HostRegexTableLoadBalancer.java ----- Original Message ----- From: "Michael Moss" <[email protected]> To: [email protected] Cc: "Michael Moss" <[email protected]> Sent: Friday, September 9, 2016 10:44:44 AM Subject: Re: 1 of 20 TServers unresponsive/slow, all writes fail? 1.7.2 (client still 1.6.2). I think its an overall design issue, no? Serving metadata is a SPOF? On Fri, Sep 9, 2016 at 10:41 AM, Christopher < [email protected] > wrote: What version of Accumulo? Could narrow down the search for known issue potentials. On Fri, Sep 9, 2016 at 10:36 AM Michael Moss < [email protected] > wrote: <blockquote> Upon further internal discussion, it looks like the metadata/root tables are served from the tservers (not an HA master for example) and the one in question was serving it. It was unable to run MajC (compaction) for many hours leading up to the time where it couldn't service requests any longer, but it was still up, hosting tablets, just very slow or unable to respond. So all writes ended up timing out. If this condition is possible and there is a SPOF here, it'd be good to see what's on the roadmap to address it. On Fri, Sep 9, 2016 at 10:24 AM, < [email protected] > wrote: <blockquote> What was happening on that 1 tserver? Was it in garbage collection? Was it having network or O/S issues? From: "Michael Moss (BLOOMBERG/ 731 LEX)" < [email protected] > To: [email protected] Sent: Friday, September 9, 2016 9:40:42 AM Subject: 1 of 20 TServers unresponsive/slow, all writes fail? Hi, We are starting to investigate an issue where 1 tserver was up, but became slow/unresponsive for several hours, yet all writes to our 20+ servers began to fail. We could see leading up to the failure that the writes were distributed among all of the tablet servers, so it wasn't a hotspot. Whenever we receive a MutationsRejectedException, we recreate the BatchWriter (ACCUMULO-2990). I'm digging into the TabletServerBatchWriter code, but any ideas what could cause this issue? Is there some sort of initialization or healthchecking that the client does where 1 server could impact all? Thanks. -Mike Caused by: org.apache.accumulo.core.client.TimedOutException: Servers timed out [ pnj-bvlt-r4n03.abc.com:31113 ] at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.java:177) ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182) ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:933) ~[stormjar.jar:1.0] at </blockquote> </blockquote>
