[ https://issues.apache.org/jira/browse/ACCUMULO-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Corey J. Nolet updated ACCUMULO-2990: ------------------------------------- Fix Version/s: (was: 1.6.2) 1.6.3 > BatchWriter never recovers from failure(s) > ------------------------------------------ > > Key: ACCUMULO-2990 > URL: https://issues.apache.org/jira/browse/ACCUMULO-2990 > Project: Accumulo > Issue Type: Bug > Components: client > Affects Versions: 1.5.1, 1.6.0 > Reporter: Josh Elser > Priority: Critical > Fix For: 1.5.3, 1.7.0, 1.6.3 > > Time Spent: 10m > Remaining Estimate: 0h > > In trying to understand what's happening in ACCUMULO-2964, I noticed that I > had similar exceptions from two different threads. One of the threads > starting working after the unexplained thrift exceptions from a tserver > restart, and the other continued to repeatedly fail for the lifetime of the > test. > I repeatedly saw this exception: > {noformat} > 2014-07-11 04:14:41,591 [replication.WorkMaker] WARN : Failed to write work > mutations for replication, will retry > org.apache.accumulo.core.client.MutationsRejectedException: # constraint > violations : 0 security codes: > {accumulo.metadata(ID:!0)=[DEFAULT_SECURITY_ERROR]} # server errors 0 # > exceptions 0 > at > org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures(TabletServerBatchWriter.java:537) > at > org.apache.accumulo.core.client.impl.TabletServerBatchWriter.addMutation(TabletServerBatchWriter.java:249) > at > org.apache.accumulo.core.client.impl.BatchWriterImpl.addMutation(BatchWriterImpl.java:45) > at > org.apache.accumulo.master.replication.WorkMaker.addWorkRecord(WorkMaker.java:184) > at > org.apache.accumulo.master.replication.WorkMaker.run(WorkMaker.java:124) > at > org.apache.accumulo.master.replication.ReplicationDriver.run(ReplicationDriver.java:91) > {noformat} > The part that struck me as odd was that the BatchWriter wasn't against the > metadata table, but the replication table. > I looked into the TabletServerBatchWriter. It appears that once the client > sees a MutationsRejectedException, that BatchWriter becomes useless as the > internal member {{somethingFailed}} is never reset back to {{false}} after > the failure is reported. Same goes for {{serverSideErrors}}, > {{unknownErrors}}, {{lastUnknownErrors}}, too. > If this is the case, this is a bug because the BatchWriter should be > resilient in this regard and not force the client to create a new Instance. > If that's infeasible to do, we should add exceptions to the BatchWriter that > fail fast when a BatchWriter is used that will report repeatedly report the > same failure over and over again. -- This message was sent by Atlassian JIRA (v6.3.4#6332)