tablet servers are losing zookeeper locks due to garbage collection even when 
there is lots of free memory
----------------------------------------------------------------------------------------------------------

                 Key: ACCUMULO-294
                 URL: https://issues.apache.org/jira/browse/ACCUMULO-294
             Project: Accumulo
          Issue Type: Bug
          Components: tserver
    Affects Versions: 1.3.5
         Environment: tablet servers on a large cluster are losing their locks
            Reporter: Eric Newton
            Assignee: Eric Newton
            Priority: Minor


Noticed that 5 tablet servers stopped on a large cluster.  Found that each 
server had lost its lock due to a zookeeper session timeout. The zookeeper 
timeout is set to 40 seconds. In all the cases, this lost lock was preceded by 
the ejection of blocks from the block cache, and a garbage collection that 
recovered >4G of memory.  The tablet servers were running with 8G, and were 
generally running with 4G free.  There was very little time attributed to 
garbage collection, at least as printed in the debug log.  The in-memory map is 
small (256M) and running the native version.

{noformat}
-XX:CMSInitiatingOccupancyFraction=75
{noformat}

to
{noformat}
-XX:CMSInitiatingOccupancyFraction=60
{noformat}

Zookeeper has already been configured with this:
{noformat}
globalOutstandingLimit=10000
{noformat}

Which helped enormously.  Each zookeeper server has between 500 and 1700 
clients.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to