You didn't mention anything about a TTransportException earlier.

I don't remember the difference between Thrift exceptions (e.g. TTransportException, TApplicationException). I think one is supposed to be network-focused (e.g. socket timeout) and the other is application-focused (e.g. TabletServer got an error).

If the Thrift exception is just wrapping an exception thrown by teh TabletServer, it's likely just necessary wrapping (which should be unwrapped by the client impl, fwiw) to support serialization over the wire.

On 7/12/19 4:38 PM, James Srinivasan wrote:
Thanks, I'll check it out. There weren't any obviously errors around hardware issues.

Is it likely that the TTransportException and commits held are related?


On Fri, 12 Jul 2019, 18:56 Josh Elser, <[email protected] <mailto:[email protected]>> wrote:

    "Commits are held" can be for a couple of different reasons, some from
    within Accumulo and some from outside.

    In general, there is an expected ordering of mutations that a
    TabletServer has to apply. A "commit" here is the application of some
    mutations by a TabletServer to the memory map and the WAL.

    This could be completely normal and you have some clients which are
    just
    writing "faster" than your TabletServers can keep up with. This
    could be
    indicative of slow flushes from memory maps to HDFS. This could be GC
    pressure causing slowness in the TServer.

    I'd suggest to take a step back:

    * Look at other messages in the DEBUG log for the tabletserver to
    see if
    you Accumulo is telling you what it's waiting on (before and after you
    see the message about commits being held)
    * Check that you're using the Accumulo native memory maps
    * Sanity-check performance of HDFS
    * Get a thread dump from a TabletServer in this state.

    If the problem truly only happens on two servers, it might indicate
    some
    bad hardware on that device (memory with errors, a disk that flips
    to r/o).

    - Josh

    On 7/12/19 10:57 AM, James Srinivasan wrote:
     > Hi all,
     >
     > We have a Kerberized Accumulo 1.7.0 (HDP3) cluster with 25 tservers.
     > Recently, a couple of clients were reporting errrors writing data
    (fat
     > fingered from cluster, apologies for typos):
     >
     >
    
org.apache.accumulo.core.client.impl.TabletServerBatchWriter.checkForFailures
     > ...
     > Caused by:
    
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer
     >
     > Digging into the logs on the problematic tservers, I think the
     > following was firing, but don't know why:
     >
     >
    
https://github.com/apache/thrift/blob/0.9.1/lib/java/src/org/apache/thrift/transport/TIOStreamTransport.java#L132
     >
     > Also, the tserver logs report:
     >
     > Internal error processing closeUpdate....TException: Commits are held
     >
     > For now, I have stopped the two problematic tservers but any help
     > debugging would be much appreciated.
     >
     > James
     >

Reply via email to