Ivan Bella created ACCUMULO-4777:
------------------------------------

             Summary: Root tablet got spammed with 1.8 million log entries
                 Key: ACCUMULO-4777
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4777
             Project: Accumulo
          Issue Type: Bug
    Affects Versions: 1.8.1
            Reporter: Ivan Bella
            Priority: Critical


We had a tserver that was handling accumulo.metadata tablets that somehow got 
into a loop where it created over 22K empty wal logs.  There were around 70 
metadata tablets and this resulted in around 1.8 million log entries in added 
to the accumulo.root table.  The only reason it stopped creating wal logs is 
because it ran out of open file handles.  This took us many hours and cups of 
coffee to clean up.

The log contained the following messages in a tight loop:

log.TabletServerLogger INFO : Using next log hdfs://...
tserver.TabletServfer INFO : Writing log marker for hdfs://...
tserver.TabletServer INFO : Marking hdfs://... closed
log.DfsLogger INFO : Slow sync cost ...
...

Unfortunately we did not have DEBUG turned on so we have no debug messages.

Tracking through the code there are three places where the 
TabletServerLogger.close method is called:
1) via resetLoggers in the TabletServerLogger, but nothing calls this method so 
this is ruled out
2) when the log gets too large or too old, but neither of those checks should 
have been hitting here.
3) In a loop that is executed (while (!success)) in the 
TabletServerLogger.write method.  In this case when we unsuccessfullty write 
something to the wal, then that one is closed and a new one is created.  This 
loop will go forever until we successfully write out the entry.  A 
DfsLogger.LogClosedException seems the most logical reason.  This is most 
likely because a ClosedChannelException was thrown from the DfsLogger.write 
methods (around line 609 in DfsLogger).

So the root cause was most likely hadoop related.  However in accumulo we 
probably should not be doing a tight retry loop around a hadoop failure.  I 
recommend at a minimum doing some sort of exponential back off and perhaps 
setting a limit on the number of retries resulting in a critical tserver 
failure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to