In short, yes. This is mitigated by the fact the metadata table can be split into many tablets. As such, not all tables would be affected by a single metadata tablet being unreachable (Dave's solution helps here).

One possible solution which could be investigated is what HBase coined as "Timeline-Consistent High Available Reads"[1]. Essentially, in addition to the read-write Tablet (as is currently the case), there are one to many read-only copies of a Tablet. This helps mitigate the case where some data is unreachable due to TabletServer problems.

However, this idea does make me a little wary for use with the metadata table.

Trying to figure out what happened on that node and get you a solution would be my preferred path forward :)

[1] http://hbase.apache.org/book.html#arch.timelineconsistent.reads

Michael Moss wrote:
1.7.2 (client still 1.6.2).

I think its an overall design issue, no? Serving metadata is a SPOF?

On Fri, Sep 9, 2016 at 10:41 AM, Christopher <[email protected]
<mailto:[email protected]>> wrote:

    What version of Accumulo? Could narrow down the search for known
    issue potentials.

    On Fri, Sep 9, 2016 at 10:36 AM Michael Moss <[email protected]
    <mailto:[email protected]>> wrote:

        Upon further internal discussion, it looks like the
        metadata/root tables are served from the tservers (not an HA
        master for example) and the one in question was serving it. It
        was unable to run MajC (compaction) for many hours leading up to
        the time where it couldn't service requests any longer, but it
        was still up, hosting tablets, just very slow or unable to
        respond. So all writes ended up timing out.

        If this condition is possible and there is a SPOF here, it'd be
        good to see what's on the roadmap to address it.

        On Fri, Sep 9, 2016 at 10:24 AM, <[email protected]
        <mailto:[email protected]>> wrote:

            What was happening on that 1 tserver? Was it in garbage
            collection? Was it having network or O/S issues?

            
------------------------------------------------------------------------
            *From: *"Michael Moss (BLOOMBERG/ 731 LEX)"
            <[email protected] <mailto:[email protected]>>
            *To: *[email protected] <mailto:[email protected]>
            *Sent: *Friday, September 9, 2016 9:40:42 AM
            *Subject: *1 of 20 TServers unresponsive/slow, all writes fail?


            Hi,

            We are starting to investigate an issue where 1 tserver was
            up, but became slow/unresponsive for several hours, yet all
            writes to our 20+ servers began to fail. We could see
            leading up to the failure that the writes were distributed
            among all of the tablet servers, so it wasn't a hotspot.
            Whenever we receive a MutationsRejectedException, we
            recreate the BatchWriter (ACCUMULO-2990). I'm digging into
            the TabletServerBatchWriter code, but any ideas what could
            cause this issue? Is there some sort of initialization or
            healthchecking that the client does where 1 server could
            impact all?

            Thanks.

            -Mike

            Caused by:
            org.apache.accumulo.core.client.TimedOutException: Servers
            timed out [pnj-bvlt-r4n03.abc.com:31113
            <http://pnj-bvlt-r4n03.abc.com:31113>] at
            
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.wroteNothing(TabletServerBatchWriter.java:177)
            ~[stormjar.jar:1.0] at
            
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$TimeoutTracker.errorOccured(TabletServerBatchWriter.java:182)
            ~[stormjar.jar:1.0] at
            
org.apache.accumulo.core.client.impl.TabletServerBatchWriter$MutationWriter.sendMutationsToTabletServer(TabletServerBatchWriter.java:933)
            ~[stormjar.jar:1.0] at



Reply via email to