tablet server runs out of memory performing a major compaction
--------------------------------------------------------------
Key: ACCUMULO-201
URL: https://issues.apache.org/jira/browse/ACCUMULO-201
Project: Accumulo
Issue Type: Bug
Components: tserver
Reporter: Eric Newton
Assignee: Eric Newton
An accumulo user watched their cluster slowly shrink: one tablet server would
fail every 8-10 minutes.
We determined that a major compaction of a single tablet would cause the tablet
server to run out of memory. That tablet would then be sent to a new server,
which would schedule a major compaction, and it would die as well.
# it was harder than it should have been to identify the tablet causing the
problem
# the tablet had a combination of several large existing files and a few bulk
loaded files with a few very large key/values
# large key/values were between *10 and 100 megabytes each*, the tablet server
had a 1G memory limit
# the next key for each file will sit in memory while performing the merge-sort
There exists a Constraint which can limit the size of mutations during normal
ingest. However, there is no constraint or check on the size of mutations that
may be bulk loaded.
The tablet server should log the key extent (range) of a tablet prior to
attempting a major compaction.
Large key values (those that approach a significant portion of the working
memory of the JVM) might need to go into a separate merge file, or might result
in multi-stage merges just to defend against an out-of-memory failure.
Tablet servers could mark tablets during a major compaction attempt. Tablets
with multiple markers could use a multi-pass merge to attempt to survive the
merge. Alternatively, the master could refuse to assign tablets with too many
markers.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira