[ https://issues.apache.org/jira/browse/KUDU-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Henke updated KUDU-2152: ------------------------------ Target Version/s: 1.8.0 (was: 1.7.0) > Tablet stuck under-replicated after some kind of tablet copy issue > ------------------------------------------------------------------ > > Key: KUDU-2152 > URL: https://issues.apache.org/jira/browse/KUDU-2152 > Project: Kudu > Issue Type: Bug > Components: consensus > Affects Versions: 1.5.0 > Reporter: Todd Lipcon > Priority: Critical > > I was stress testing with the following setup: > - 8 servers (n1-standard-4 GCE boxes) > - created a bunch of 100-tablet tablets using loadgen until I had ~2500 > replicas on each server > - mounted another server using sshfs and put cmeta on that mount point (to > make slower cmeta writes) > - stress -c4 on all machines > - shut down a server and wait for re-replication (green ksck), restart the > server, rinse repeat > Eventually I got a stuck tablet. ksck reports: > {code} > Tablet 271df8901d98442cb478593babd8a609 of table > 'loadgen_auto_8e32cb07eb83458da4ec4d228bcb0f5a' is under-replicated: 1 > replica(s) not RUNNING > 20d4d86f182043398594b67492d13fdc (kudu513-8.gce.cloudera.com:7050): RUNNING > [LEADER] > c2ea8f22f4034bcc97e26c9236811960 (kudu513-1.gce.cloudera.com:7050): bad > state > State: STOPPED > Data state: TABLET_DATA_COPYING > Last status: Deleted tablet blocks from disk > cd0997b908ad41839f56a1b61210f2d4 (kudu513-3.gce.cloudera.com:7050): RUNNING > 1 replicas' active configs differ from the master's. > All the peers reported by the master and tablet servers are: > A = 20d4d86f182043398594b67492d13fdc > D = 471027436ee8405ab7cdf8d22407696b > B = c2ea8f22f4034bcc97e26c9236811960 > > C = cd0997b908ad41839f56a1b61210f2d4 > The consensus matrix is: > Config source | Voters | Current term | Config index | Committed? > ---------------+------------------+--------------+--------------+------------ > master | A* B C | | | Yes > A | A* B C | 11 | 29 | Yes > B | D B C | 9 | 23 | Yes > C | A* B C | 11 | 29 | Yes > {code} > The leader ("A" above) just keeps reporting that it's failing to send > requests to "B" because it's getting TABLET_NOT_RUNNING. So it never evicts > it (the leader treats TABLET_NOT_RUNNING as a temporary condition assuming > that it actually means BOOTSTRAPPING). > "B"'s last bit in the logs were: > {code} > I0920 16:41:48.556422 3808 tablet_copy_client.cc:209] T > 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: tablet > copy: Beginning tablet copy session from remote peer at address > kudu513-8.gce.cloudera.com:7050 > I0920 16:41:48.562335 3808 ts_tablet_manager.cc:1118] T > 271df8901d98442cb478593babd8a609 P c2ea8f22f4034bcc97e26c9236811960: Deleting > tablet data with delete state TABLET_DATA_COPYING > W0920 16:41:48.578610 3808 env_util.cc:277] Failed to determine if path is a > directory: > /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: > Not found: > /data0/ts-data/tablet-meta/271df8901d98442cb478593babd8a609.kudutmp.2Tu0Uy: > No such file or directory (error 2) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)