Upon further internal discussion, it looks like the metadata/root tables are served from the tservers (not an HA master for example) and the one in question was serving it. It was unable to run MajC (compaction) for many hours leading up to the time where it couldn't service requests any longer, but it was still up, hosting tablets, just very slow or unable to respond. So all writes ended up timing out.
If this condition is possible and there is a SPOF here, it'd be good to see what's on the roadmap to address it. On Fri, Sep 9, 2016 at 10:24 AM, <[email protected]> wrote: > What was happening on that 1 tserver? Was it in garbage collection? Was it > having network or O/S issues? > > ------------------------------ > *From: *"Michael Moss (BLOOMBERG/ 731 LEX)" <[email protected]> > *To: *[email protected] > *Sent: *Friday, September 9, 2016 9:40:42 AM > *Subject: *1 of 20 TServers unresponsive/slow, all writes fail? > > > Hi, > > We are starting to investigate an issue where 1 tserver was up, but became > slow/unresponsive for several hours, yet all writes to our 20+ servers > began to fail. We could see leading up to the failure that the writes were > distributed among all of the tablet servers, so it wasn't a hotspot. > Whenever we receive a MutationsRejectedException, we recreate the > BatchWriter (ACCUMULO-2990). I'm digging into the TabletServerBatchWriter > code, but any ideas what could cause this issue? Is there some sort of > initialization or healthchecking that the client does where 1 server could > impact all? > > Thanks. > > -Mike > > Caused by: org.apache.accumulo.core.client.TimedOutException: Servers > timed out [pnj-bvlt-r4n03.abc.com:31113] at org.apache.accumulo.core. > client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing( > TabletServerBatchWriter.java:177) ~[stormjar.jar:1.0] at > org.apache.accumulo.core.client.impl.TabletServerBatchWriter$ > TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182) > ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl. > TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer( > TabletServerBatchWriter.java:933) ~[stormjar.jar:1.0] at > >
