[ https://issues.apache.org/jira/browse/KUDU-1839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jean-Daniel Cryans updated KUDU-1839: ------------------------------------- Priority: Major (was: Critical) > DNS failure during tablet creation lead to undeletable tablet > ------------------------------------------------------------- > > Key: KUDU-1839 > URL: https://issues.apache.org/jira/browse/KUDU-1839 > Project: Kudu > Issue Type: Bug > Components: master, tablet > Affects Versions: 1.2.0 > Reporter: Adar Dembo > > During a YCSB workload, two tservers died due to DNS resolution timeouts. For > example: > {noformat} > F0117 09:21:14.952937 8392 raft_consensus.cc:1985] Check failed: _s.ok() Bad > status: Network error: Could not obtain a remote proxy to the peer.: Unable > to resolve address 've0130.halxg.cloudera.com': Name or service not known > {noformat} > It's not clear why this happened; perhaps table creation places an inordinate > strain on DNS due to concurrent resolution load from all the bootstrapping > peers. > In any case, when these tservers were restarted, two tablets failed to > bootstrap, both for the same reason. I'll focus on just one tablet from here > on out to simplify troubleshooting: > {noformat} > E0117 15:35:45.567312 85124 ts_tablet_manager.cc:749] T > 8c167c441a7d44b8add737d13797e694 P 7425c65d80f54f2da0a85494a5eb3e68: Tablet > failed to bootstrap: Not found: Unable to load Consensus metadata: > /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or > directory (error 2) > {noformat} > Eventually, the master decided to delete this tablet: > {noformat} > I0117 15:42:32.119601 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > I0117 15:42:32.139128 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > I0117 15:42:32.181843 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > I0117 15:42:32.276289 85166 tablet_service.cc:672] Processing DeleteTablet > for tablet 8c167c441a7d44b8add737d13797e694 with delete_type > TABLET_DATA_TOMBSTONED (TS 7425c65d80f54f2da0a85494a5eb3e68 not found in new > config with opid_index 29) from {real_user=kudu} at 10.17.236.18:42153 > {noformat} > As can be seen by the presence of multiple deletion requests, each one > failed. It's annoying that the tserver didn't log why. But the master did: > {noformat} > I0117 15:42:32.117022 33903 catalog_manager.cc:2758] Sending > DeleteTablet(TABLET_DATA_TOMBSTONED) for tablet > 8c167c441a7d44b8add737d13797e694 on 7425c65d80f54f2da0a85494a5eb3e68 > (ve0122.halxg.cloudera.com:7050) (TS 7425c65d80f54f2da0a85494a5eb3e68 not > found in new config with opid_index 29) > W0117 15:42:32.117463 33890 catalog_manager.cc:2725] TS > 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): delete > failed for tablet 8c167c441a7d44b8add737d13797e694 with error code > TABLET_NOT_RUNNING: Illegal state: Consensus not available. Tablet shutting > down > I0117 15:42:32.117491 33890 catalog_manager.cc:2522] Scheduling retry of > 8c167c441a7d44b8add737d13797e694 Delete Tablet RPC for > TS=7425c65d80f54f2da0a85494a5eb3e68 with a delay of 19ms (attempt = 1)... > {noformat} > This isn't a fatal error as far as the master is concerned, so it retries the > deletion forever. > Meanwhile, the broken replica of this tablet still appears to be part of the > replication group. At least, that's true as far as both the master web UI and > the tserver web UI are concerned. The leader tserver is logging this error > repeatedly: > {noformat} > W0117 16:38:04.797828 81809 consensus_peers.cc:329] T > 8c167c441a7d44b8add737d13797e694 P 335d132897de4bdb9b87443f2c487a42 -> Peer > 7425c65d80f54f2da0a85494a5eb3e68 (ve0122.halxg.cloudera.com:7050): Couldn't > send request to peer 7425c65d80f54f2da0a85494a5eb3e68 for tablet > 8c167c441a7d44b8add737d13797e694. Error code: TABLET_NOT_RUNNING (12). > Status: Illegal state: Tablet not RUNNING: FAILED: Not found: Unable to load > Consensus metadata: > /data/2/kudu/consensus-meta/8c167c441a7d44b8add737d13797e694: No such file or > directory (error 2). Retrying in the next heartbeat period. Already tried > 6666 times. > {noformat} > It's not clear to me exactly what state the replication group is in. The > master did issue an AddServer request: > {noformat} > I0117 15:42:32.117065 33903 catalog_manager.cc:3069] Started AddServer task > for tablet 8c167c441a7d44b8add737d13797e694 > {noformat} > But the leader of the tablet still thinks the broken replica is in the > replication group. So is it a tablet with two healthy replicas and one broken > one, that can't recover? Maybe. > So a couple things are broken here: > # Table creation probably created a DNS resolution storm. > # Failure in DNS resolution is not retried, and led to tserver death. > # On bootstrap, this replica was detected as having a tablet-meta file but no > consensus-meta, and was set aside as corrupt (good). But the lack of a > consensus-meta means there's no consensus state and so the tserver cannot > perform an "atomic delete" as requested by the master. Must we manually > delete this replica? Or should the master be able to force the issue? > # The tserver did not log the tablet deletion failure. > # The master retried the deletion in perpetuity. > # Re-replication of this tablet by the leader appears to be broken. > I think at least some of these issues are tracked in other JIRAs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)