1.7.2 (client still 1.6.2). I think its an overall design issue, no? Serving metadata is a SPOF?
On Fri, Sep 9, 2016 at 10:41 AM, Christopher <[email protected]> wrote: > What version of Accumulo? Could narrow down the search for known issue > potentials. > > On Fri, Sep 9, 2016 at 10:36 AM Michael Moss <[email protected]> > wrote: > >> Upon further internal discussion, it looks like the metadata/root tables >> are served from the tservers (not an HA master for example) and the one in >> question was serving it. It was unable to run MajC (compaction) for many >> hours leading up to the time where it couldn't service requests any longer, >> but it was still up, hosting tablets, just very slow or unable to respond. >> So all writes ended up timing out. >> >> If this condition is possible and there is a SPOF here, it'd be good to >> see what's on the roadmap to address it. >> >> On Fri, Sep 9, 2016 at 10:24 AM, <[email protected]> wrote: >> >>> What was happening on that 1 tserver? Was it in garbage collection? Was >>> it having network or O/S issues? >>> >>> ------------------------------ >>> *From: *"Michael Moss (BLOOMBERG/ 731 LEX)" <[email protected]> >>> *To: *[email protected] >>> *Sent: *Friday, September 9, 2016 9:40:42 AM >>> *Subject: *1 of 20 TServers unresponsive/slow, all writes fail? >>> >>> >>> Hi, >>> >>> We are starting to investigate an issue where 1 tserver was up, but >>> became slow/unresponsive for several hours, yet all writes to our 20+ >>> servers began to fail. We could see leading up to the failure that the >>> writes were distributed among all of the tablet servers, so it wasn't a >>> hotspot. Whenever we receive a MutationsRejectedException, we recreate the >>> BatchWriter (ACCUMULO-2990). I'm digging into the TabletServerBatchWriter >>> code, but any ideas what could cause this issue? Is there some sort of >>> initialization or healthchecking that the client does where 1 server could >>> impact all? >>> >>> Thanks. >>> >>> -Mike >>> >>> Caused by: org.apache.accumulo.core.client.TimedOutException: Servers >>> timed out [pnj-bvlt-r4n03.abc.com:31113] at org.apache.accumulo.core. >>> client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing( >>> TabletServerBatchWriter.java:177) ~[stormjar.jar:1.0] at >>> org.apache.accumulo.core.client.impl.TabletServerBatchWriter$ >>> TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182) >>> ~[stormjar.jar:1.0] at org.apache.accumulo.core.client.impl. >>> TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer( >>> TabletServerBatchWriter.java:933) ~[stormjar.jar:1.0] at >>> >>> >>
