Moving discussion to JIRA: ACCUMULO-3710
<https://issues.apache.org/jira/browse/ACCUMULO-3710>
~Dylan
On Fri, Apr 3, 2015 at 12:09 AM, Dylan Hutchison <dhutc...@mit.edu
<mailto:dhutc...@mit.edu>> wrote:
Yes, definitely OOME. My friend Eric crashed Accumulo again and we
saw this in tserver_localhost.out:
#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 12833"...
On Thu, Apr 2, 2015 at 11:49 PM, Dylan Hutchison <dhutc...@mit.edu
<mailto:dhutc...@mit.edu>> wrote:
I think it is an OOME. Here's the debug log file, showing a
clear descend from 189MB free to 52kB free memory before
manually restarting the tserver 4 minutes later. Looks like I
lost the .err files for now; would need to reproduce the crash
to get them again.
2015-03-26 08:34:01,242 [tserver.TabletServer] DEBUG: gc
ParNew=26.24(+0.01) secs ConcurrentMarkSweep=0.13(+0.00)
secs*freemem=189,300,488(-330,224) *totalmem=259,522,560
2015-03-26 08:34:01,549 [tserver.TabletServer] DEBUG: ScanSess
tid 127.0.0.1:55823 <http://127.0.0.1:55823> 6r 374,161 entries
in 2.98 secs, nbTimes = [1 69 3.27 375]
2015-03-26 08:34:01,842 [Audit ] INFO : operation: permitted;
user: root; client: 127.0.0.1:55823 <http://127.0.0.1:55823>;
2015-03-26 08:34:01,842 [Audit ] INFO : operation: permitted;
user: root; client: 127.0.0.1:55823 <http://127.0.0.1:55823>;
2015-03-26 08:34:01,844 [tserver.TabletServer] DEBUG: ScanSess
tid 127.0.0.1:55823 <http://127.0.0.1:55823> !0 5 entries in
0.00 secs, nbTimes = [1 1 1.00 1]
2015-03-26 08:34:03,034 [tserver.TabletServer] DEBUG: Got
getScans message from user: !SYSTEM
2015-03-26 08:34:03,091 [tserver.TabletServer] DEBUG:
MultiScanSess 127.0.0.1:38998 <http://127.0.0.1:38998> 2 entries
in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
2015-03-26 08:34:04,507 [tserver.TabletServer] DEBUG: gc
ParNew=26.38(+0.14) secs ConcurrentMarkSweep=0.99(+0.86) secs
*freemem=44,246,264(-145,384,448) *totalmem=259,522,560
2015-03-26 08:34:05,963 [tserver.TabletServer] DEBUG: ScanSess
tid 127.0.0.1:55017 <http://127.0.0.1:55017> !0 0 entries in
0.00 secs, nbTimes = [2 2 2.00 1]
2015-03-26 08:34:05,966 [tserver.TabletServer] DEBUG: gc
ParNew=26.38(+0.00) secs ConcurrentMarkSweep=2.25(+1.26) secs
*freemem=6,657,016(-182,973,696) *totalmem=259,522,560
2015-03-26 08:34:07,549 [tserver.TabletServer] DEBUG: gc
ParNew=26.38(+0.00) secs ConcurrentMarkSweep=3.73(+1.48) secs
*freemem=439,152(-189,191,560) *totalmem=259,522,560
2015-03-26 08:34:08,284 [tserver.TabletServer] DEBUG: Got
getScans message from user: !SYSTEM
*2015-03-26 08:34:10,469 [tserver.TabletServer] WARN : Running
low on memory*
2015-03-26 08:34:10,470 [tserver.TabletServer] DEBUG: gc
ParNew=26.38(+0.00) secs ConcurrentMarkSweep=6.63(+2.90) secs
*freemem=52,816(-189,577,896) *totalmem=259,522,560
2015-03-26 08:34:14,623 [tserver.TabletServer] DEBUG: Got
getScans message from user: !SYSTEM
2015-03-26 08:34:17,382 [tserver.TabletServer] DEBUG: ScanSess
tid 127.0.0.1:55017 <http://127.0.0.1:55017> !0 0 entries in
5.04 secs, nbTimes = [4,972 4,972 4,972.00 1]
2015-03-26 08:34:24,674 [tserver.TabletServer] DEBUG: Got
getScans message from user: !SYSTEM
2015-03-26 08:34:35,716 [cache.LruBlockCache] DEBUG: Cache
Stats: Sizes: Total=23.286858MB (24418040), Free=6.7131424MB
(7039240), Max=30.0MB (31457280), Counts: Blocks=7750,
Access=125628, Hit=102578, Miss=23050, Evictions=25,
Evicted=15299, Ratios: Hit Ratio=81.65218234062195%, Miss
Ratio=18.34782063961029%, Evicted/Run=611.9600219726562,
Duplicate Reads=1
*2015-03-26 08:38:37,256 [server.Accumulo] INFO : tserver starting*
On Thu, Apr 2, 2015 at 6:34 PM, Josh Elser <josh.el...@gmail.com
<mailto:josh.el...@gmail.com>> wrote:
That seems perfectly reasonable to me, IMO. I'm surprised to
hear the tserver crashed.
Taking a quick glance at the code, it looks like this would
be a good place to do some optimization in the
BatchScanner's impl (TabletServerBatchReaderImpl). The
BatchScanner will bin the ranges to the tablets and the
servers hosting those tablets. Normally, this would be
spread out, but, in your single server case, all 1M rows
would all go to a single TabletServer in one RPC call.
I'm guessing a good optimization here would be to check the
size of a batch of Ranges for a single tabletserver, and
when above a certain threshold, split the batch in half and
try to reprocess each half (the recursion would naturally
keep splitting until we get down to some high-watermark).
Point being, if your client VM constructed the Ranges
without issue, the BatchScanner impl should be smart enough
to not knock over a TabletServer.
What was the reason the tserver died? OOME? Was there
anything at the end of the log files or in the .out/.err files?
- Josh
Dylan Hutchison wrote:
A friend of mine has a use case where he wants to scan
~1M individual
rows, scattered across a ~15GB table. He performed the
following:
1. Gather a List of Range objects, each one a singleton
range spanning
an entire row.
2. Create a BatchScanner with one read thread.
3. Set the ranges via BatchScanner.setRanges()
4. Start iterating through the scanner.
Performing these steps crashed the TabletServer for my
friend (haven't
had time to verify it myself yet). We're using a
single-node standalone
1.6.1 Accumulo instance.
Is this a bad way to use Accumulo? I advised my friend
to batch the
reads into groups of ~10k ranges and see if that helps.
I wanted to
check with the community and see if we're doing
something weird. If the
behavior should have worked, I can try to put together a
test case
reproducing it, that creates a table with many entries
and then scans
with many ranges.
Thanks,
Dylan Hutchison