Thanks! I didn't even have to ask :)

Dylan Hutchison wrote:
Moving discussion to JIRA: ACCUMULO-3710
<https://issues.apache.org/jira/browse/ACCUMULO-3710>
~Dylan


On Fri, Apr 3, 2015 at 12:09 AM, Dylan Hutchison <dhutc...@mit.edu
<mailto:dhutc...@mit.edu>> wrote:

    Yes, definitely OOME.  My friend Eric crashed Accumulo again and we
    saw this in tserver_localhost.out:

    #
    # java.lang.OutOfMemoryError: Java heap space
    # -XX:OnOutOfMemoryError="kill -9 %p"
    #   Executing /bin/sh -c "kill -9 12833"...


    On Thu, Apr 2, 2015 at 11:49 PM, Dylan Hutchison <dhutc...@mit.edu
    <mailto:dhutc...@mit.edu>> wrote:

        I think it is an OOME.  Here's the debug log file, showing a
        clear descend from 189MB free to 52kB free memory before
        manually restarting the tserver 4 minutes later.  Looks like I
        lost the .err files for now; would need to reproduce the crash
        to get them again.

        2015-03-26 08:34:01,242 [tserver.TabletServer] DEBUG: gc
        ParNew=26.24(+0.01) secs ConcurrentMarkSweep=0.13(+0.00)
        secs*freemem=189,300,488(-330,224) *totalmem=259,522,560
        2015-03-26 08:34:01,549 [tserver.TabletServer] DEBUG: ScanSess
        tid 127.0.0.1:55823 <http://127.0.0.1:55823> 6r 374,161 entries
        in 2.98 secs, nbTimes = [1 69 3.27 375]
        2015-03-26 08:34:01,842 [Audit   ] INFO : operation: permitted;
        user: root; client: 127.0.0.1:55823 <http://127.0.0.1:55823>;
        2015-03-26 08:34:01,842 [Audit   ] INFO : operation: permitted;
        user: root; client: 127.0.0.1:55823 <http://127.0.0.1:55823>;
        2015-03-26 08:34:01,844 [tserver.TabletServer] DEBUG: ScanSess
        tid 127.0.0.1:55823 <http://127.0.0.1:55823> !0 5 entries in
        0.00 secs, nbTimes = [1 1 1.00 1]
        2015-03-26 08:34:03,034 [tserver.TabletServer] DEBUG: Got
        getScans message from user: !SYSTEM
        2015-03-26 08:34:03,091 [tserver.TabletServer] DEBUG:
        MultiScanSess 127.0.0.1:38998 <http://127.0.0.1:38998> 2 entries
        in 0.00 secs (lookup_time:0.00 secs tablets:1 ranges:1)
        2015-03-26 08:34:04,507 [tserver.TabletServer] DEBUG: gc
        ParNew=26.38(+0.14) secs ConcurrentMarkSweep=0.99(+0.86) secs
        *freemem=44,246,264(-145,384,448) *totalmem=259,522,560
        2015-03-26 08:34:05,963 [tserver.TabletServer] DEBUG: ScanSess
        tid 127.0.0.1:55017 <http://127.0.0.1:55017> !0 0 entries in
        0.00 secs, nbTimes = [2 2 2.00 1]
        2015-03-26 08:34:05,966 [tserver.TabletServer] DEBUG: gc
        ParNew=26.38(+0.00) secs ConcurrentMarkSweep=2.25(+1.26) secs
        *freemem=6,657,016(-182,973,696) *totalmem=259,522,560
        2015-03-26 08:34:07,549 [tserver.TabletServer] DEBUG: gc
        ParNew=26.38(+0.00) secs ConcurrentMarkSweep=3.73(+1.48) secs
        *freemem=439,152(-189,191,560) *totalmem=259,522,560
        2015-03-26 08:34:08,284 [tserver.TabletServer] DEBUG: Got
        getScans message from user: !SYSTEM
        *2015-03-26 08:34:10,469 [tserver.TabletServer] WARN : Running
        low on memory*
        2015-03-26 08:34:10,470 [tserver.TabletServer] DEBUG: gc
        ParNew=26.38(+0.00) secs ConcurrentMarkSweep=6.63(+2.90) secs
        *freemem=52,816(-189,577,896) *totalmem=259,522,560
        2015-03-26 08:34:14,623 [tserver.TabletServer] DEBUG: Got
        getScans message from user: !SYSTEM
        2015-03-26 08:34:17,382 [tserver.TabletServer] DEBUG: ScanSess
        tid 127.0.0.1:55017 <http://127.0.0.1:55017> !0 0 entries in
        5.04 secs, nbTimes = [4,972 4,972 4,972.00 1]
        2015-03-26 08:34:24,674 [tserver.TabletServer] DEBUG: Got
        getScans message from user: !SYSTEM
        2015-03-26 08:34:35,716 [cache.LruBlockCache] DEBUG: Cache
        Stats: Sizes: Total=23.286858MB (24418040), Free=6.7131424MB
        (7039240), Max=30.0MB (31457280), Counts: Blocks=7750,
        Access=125628, Hit=102578, Miss=23050, Evictions=25,
        Evicted=15299, Ratios: Hit Ratio=81.65218234062195%, Miss
        Ratio=18.34782063961029%, Evicted/Run=611.9600219726562,
        Duplicate Reads=1
        *2015-03-26 08:38:37,256 [server.Accumulo] INFO : tserver starting*





        On Thu, Apr 2, 2015 at 6:34 PM, Josh Elser <josh.el...@gmail.com
        <mailto:josh.el...@gmail.com>> wrote:

            That seems perfectly reasonable to me, IMO. I'm surprised to
            hear the tserver crashed.

            Taking a quick glance at the code, it looks like this would
            be a good place to do some optimization in the
            BatchScanner's impl (TabletServerBatchReaderImpl). The
            BatchScanner will bin the ranges to the tablets and the
            servers hosting those tablets. Normally, this would be
            spread out, but, in your single server case, all 1M rows
            would all go to a single TabletServer in one RPC call.

            I'm guessing a good optimization here would be to check the
            size of a batch of Ranges for a single tabletserver, and
            when above a certain threshold, split the batch in half and
            try to reprocess each half (the recursion would naturally
            keep splitting until we get down to some high-watermark).

            Point being, if your client VM constructed the Ranges
            without issue, the BatchScanner impl should be smart enough
            to not knock over a TabletServer.

            What was the reason the tserver died? OOME? Was there
            anything at the end of the log files or in the .out/.err files?

            - Josh


            Dylan Hutchison wrote:

                A friend of mine has a use case where he wants to scan
                ~1M individual
                rows, scattered across a ~15GB table.  He performed the
                following:

                1. Gather a List of Range objects, each one a singleton
                range spanning
                an entire row.
                2. Create a BatchScanner with one read thread.
                3. Set the ranges via BatchScanner.setRanges()
                4. Start iterating through the scanner.

                Performing these steps crashed the TabletServer for my
                friend (haven't
                had time to verify it myself yet). We're using a
                single-node standalone
                1.6.1 Accumulo instance.

                Is this a bad way to use Accumulo?  I advised my friend
                to batch the
                reads into groups of ~10k ranges and see if that helps.
                I wanted to
                check with the community and see if we're doing
                something weird.  If the
                behavior should have worked, I can try to put together a
                test case
                reproducing it, that creates a table with many entries
                and then scans
                with many ranges.

                Thanks,
                Dylan Hutchison




Reply via email to