Adar Dembo created KUDU-1839:
--------------------------------

             Summary: DNS failure during tablet creation lead to undeletable 
tablet
                 Key: KUDU-1839
                 URL: https://issues.apache.org/jira/browse/KUDU-1839
             Project: Kudu
          Issue Type: Bug
          Components: master, tablet
    Affects Versions: 1.2.0
            Reporter: Adar Dembo
            Priority: Critical


During a YCSB workload, two tservers died due to DNS resolution timeouts. For 
example: 

{noformat}
F0117 09:21:14.952937  8392 raft_consensus.cc:1985] Check failed: _s.ok() Bad 
status: Network error: Could not obtain a remote proxy to the peer.: Unable to 
resolve address 've0130.halxg.cloudera.com': Name or service not known
{noformat}

It's not clear why this happened; perhaps table creation places an inordinate 
strain on DNS due to concurrent resolution load from all the bootstrapping 
peers.

In any case, when these tservers were restarted, two tablets failed to 
bootstrap, both for the same reason. I'll focus on just one tablet from here on 
out to simplify troubleshooting:

{noformat}
E0117 15:35:45.567312 85124 ts_tablet_manager.cc:749] T 
8c167c441a7d44b8add737d13797e694 P 7425c65d80f54f2da0a85494a5eb3e68: Tablet 
failed to bootstrap: Not found: Unable to load Consensus metadata: 
/data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or 
directory (error 2)
{noformat}

Eventually, the master decided to delete this tablet:

{noformat}
I0117 15:42:32.119601 85166 tablet_service.cc:672] Processing DeleteTablet for 
tablet 8c167c441a7d44b8add737d13797e694 with delete_type TABLET_DATA_TOMBSTONED 
(TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new config with opid_index 
29) from {real_user=kudu} at 10.17.236.18:42153
I0117 15:42:32.139128 85166 tablet_service.cc:672] Processing DeleteTablet for 
tablet 8c167c441a7d44b8add737d13797e694 with delete_type TABLET_DATA_TOMBSTONED 
(TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new config with opid_index 
29) from {real_user=kudu} at 10.17.236.18:42153
I0117 15:42:32.181843 85166 tablet_service.cc:672] Processing DeleteTablet for 
tablet 8c167c441a7d44b8add737d13797e694 with delete_type TABLET_DATA_TOMBSTONED 
(TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new config with opid_index 
29) from {real_user=kudu} at 10.17.236.18:42153
I0117 15:42:32.276289 85166 tablet_service.cc:672] Processing DeleteTablet for 
tablet 8c167c441a7d44b8add737d13797e694 with delete_type TABLET_DATA_TOMBSTONED 
(TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new config with opid_index 
29) from {real_user=kudu} at 10.17.236.18:42153
{noformat}

As can be seen by the presence of multiple deletion requests, each one failed. 
It's annoying that the tserver didn't log why. But the master did:

{noformat}
I0117 15:42:32.117022 33903 catalog_manager.cc:2758] Sending 
DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet 
8c167c441a7d44b8add737d13797e694 on 7425c65d80f54f2da0a85494a5eb3e68 
(ve0122.halxg.cloudera.com:7050) (TS 7425c65d80f54f2da0a85494a5eb3e68 not found 
in new config with opid_index 29)
W0117 15:42:32.117463 33890 catalog_manager.cc:2725] TS 
7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): delete 
failed for tablet 8c167c441a7d44b8add737d13797e694 with error code 
TABLET_NOT_RUNNING: Illegal state: Consensus not available. Tablet shutting down
I0117 15:42:32.117491 33890 catalog_manager.cc:2522] Scheduling retry of 
8c167c441a7d44b8add737d13797e694 Delete Tablet RPC for 
TS=7425c65d80f54f2da0a85494a5eb3e68 with a delay of 19ms (attempt = 1)...
{noformat}

This isn't a fatal error as far as the master is concerned, so it retries the 
deletion forever.

Meanwhile, the broken replica of this tablet still appears to be part of the 
replication group. At least, that's true as far as both the master web UI and 
the tserver web UI are concerned. The leader tserver is logging this error 
repeatedly:

{noformat}
W0117 16:38:04.797828 81809 consensus_peers.cc:329] T 
8c167c441a7d44b8add737d13797e694 P 335d132897de4bdb9b87443f2c487a42 -> Peer 
7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): Couldn't 
send request to peer 7425c65d80f54f2da0a85494a5eb3e68 for tablet 
8c167c441a7d44b8add737d13797e694. Error code: TABLET_NOT_RUNNING (12). Status: 
Illegal state: Tablet not RUNNING: FAILED: Not found: Unable to load Consensus 
metadata: /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such 
file or directory (error 2). Retrying in the next heartbeat period. Already 
tried 6666 times.
{noformat}

It's not clear to me exactly what state the replication group is in. The master 
did issue an AddServer request:

{noformat}
I0117 15:42:32.117065 33903 catalog_manager.cc:3069] Started AddServer task for 
tablet 8c167c441a7d44b8add737d13797e694
{noformat}

But the leader of the tablet still thinks the broken replica is in the 
replication group. So is it a tablet with two healthy replicas and one broken 
one, that can't recover? Maybe.

So a couple things are broken here:
# Table creation probably created a DNS resolution storm.
# Failure in DNS resolution is not retried, and led to tserver death.
# On bootstrap, this replica was detected as having a tablet-meta file but no 
consensus-meta, and was set aside as corrupt (good). But the lack of a 
consensus-meta means there's no consensus state and so the tserver cannot 
perform an "atomic delete" as requested by the master. Must we manually delete 
this replica? Or should the master be able to force the issue?
# The tserver did not log the tablet deletion failure.
# The master retried the deletion in perpetuity.
# Re-replication of this tablet by the leader appears to be broken.

I think at least some of these issues are tracked in other JIRAs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to