Sorry for not being clearer - the TTransportException was in the link. For some reason, thrift seems to encounter a seemingly unexpected end of stream so it seems at the transport level. I'll try to get a better stack trace tomorrow.
On Mon, 15 Jul 2019 at 17:06, Josh Elser <[email protected]> wrote: > > You didn't mention anything about a TTransportException earlier. > > I don't remember the difference between Thrift exceptions (e.g. > TTransportException, TApplicationException). I think one is supposed to > be network-focused (e.g. socket timeout) and the other is > application-focused (e.g. TabletServer got an error). > > If the Thrift exception is just wrapping an exception thrown by teh > TabletServer, it's likely just necessary wrapping (which should be > unwrapped by the client impl, fwiw) to support serialization over the wire. > > On 7/12/19 4:38 PM, James Srinivasan wrote: > > Thanks, I'll check it out. There weren't any obviously errors around > > hardware issues. > > > > Is it likely that the TTransportException and commits held are related? > > > > > > On Fri, 12 Jul 2019, 18:56 Josh Elser, <[email protected] > > <mailto:[email protected]>> wrote: > > > > "Commits are held" can be for a couple of different reasons, some from > > within Accumulo and some from outside. > > > > In general, there is an expected ordering of mutations that a > > TabletServer has to apply. A "commit" here is the application of some > > mutations by a TabletServer to the memory map and the WAL. > > > > This could be completely normal and you have some clients which are > > just > > writing "faster" than your TabletServers can keep up with. This > > could be > > indicative of slow flushes from memory maps to HDFS. This could be GC > > pressure causing slowness in the TServer. > > > > I'd suggest to take a step back: > > > > * Look at other messages in the DEBUG log for the tabletserver to > > see if > > you Accumulo is telling you what it's waiting on (before and after you > > see the message about commits being held) > > * Check that you're using the Accumulo native memory maps > > * Sanity-check performance of HDFS > > * Get a thread dump from a TabletServer in this state. > > > > If the problem truly only happens on two servers, it might indicate > > some > > bad hardware on that device (memory with errors, a disk that flips > > to r/o). > > > > - Josh > > > > On 7/12/19 10:57 AM, James Srinivasan wrote: > > > Hi all, > > > > > > We have a Kerberized Accumulo 1.7.0 (HDP3) cluster with 25 tservers. > > > Recently, a couple of clients were reporting errrors writing data > > (fat > > > fingered from cluster, apologies for typos): > > > > > > > > > > org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures > > > ... > > > Caused by: > > > > org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer > > > > > > Digging into the logs on the problematic tservers, I think the > > > following was firing, but don't know why: > > > > > > > > > > https://github.com/apache/thrift/blob/0.9.1/lib/java/src/org/apache/thrift/transport/TIOStreamTransport.java#L132 > > > > > > Also, the tserver logs report: > > > > > > Internal error processing closeUpdate....TException: Commits are held > > > > > > For now, I have stopped the two problematic tservers but any help > > > debugging would be much appreciated. > > > > > > James > > > > >
