tablet servers are losing zookeeper locks due to garbage collection even when
there is lots of free memory
----------------------------------------------------------------------------------------------------------
Key: ACCUMULO-294
URL: https://issues.apache.org/jira/browse/ACCUMULO-294
Project: Accumulo
Issue Type: Bug
Components: tserver
Affects Versions: 1.3.5
Environment: tablet servers on a large cluster are losing their locks
Reporter: Eric Newton
Assignee: Eric Newton
Priority: Minor
Noticed that 5 tablet servers stopped on a large cluster. Found that each
server had lost its lock due to a zookeeper session timeout. The zookeeper
timeout is set to 40 seconds. In all the cases, this lost lock was preceded by
the ejection of blocks from the block cache, and a garbage collection that
recovered >4G of memory. The tablet servers were running with 8G, and were
generally running with 4G free. There was very little time attributed to
garbage collection, at least as printed in the debug log. The in-memory map is
small (256M) and running the native version.
{noformat}
-XX:CMSInitiatingOccupancyFraction=75
{noformat}
to
{noformat}
-XX:CMSInitiatingOccupancyFraction=60
{noformat}
Zookeeper has already been configured with this:
{noformat}
globalOutstandingLimit=10000
{noformat}
Which helped enormously. Each zookeeper server has between 500 and 1700
clients.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira