[ https://issues.apache.org/jira/browse/KUDU-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
HeLifu updated KUDU-2453: ------------------------- Description: I have met this problem again on 2018/10/26. And now the kudu version is 1.7.2. ----------------------------------------------------- We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and there are some load on the kudu cluster. Then someone else created a big table which had tens of thousands of tablets from impala-shell (that was a mistake). {code:java} CREATE TABLE XXX( ... PRIMARY KEY (...) ) PARTITION BY HASH (...) PARTITIONS 100, RANGE (...) ( PARTITION "2018-10-24" <= VALUES < "2018-10-24\000", PARTITION "2018-10-25" <= VALUES < "2018-10-25\000", ... PARTITION "2018-12-07" <= VALUES < "2018-12-07\000" ) STORED AS KUDU TBLPROPERTIES ('kudu.master_addresses'= '...'); {code} Here are the logs after creating table (only pick one tablet as example): {code:java} ------------------------------------------------------Kudu-master log ======e884bda6bbd3482f94c07ca0f34f99a4====== W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService from 10.120.219.118:50247 dropped due to backpressure. The service queue is full; it has 512 items. I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1) ... ======Be replaced by 0b144c00f35d48cca4d4981698faef72====== W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72 ... I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Sending DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4 ... I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST) ... W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 already in progress: creating tablet ... I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1) ... W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService from 10.120.219.118:59735 dropped due to backpressure. The service queue is full; it has 512 items. I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1) ... ======Be replaced by c0e0acc448fc42fc9e48f5025b112a75====== W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed timeout. Replacing with a new tablet c0e0acc448fc42fc9e48f5025b112a75 ... ------------------------------------------------------Kudu-tserver log I1024 11:40:52.014571 137358 tablet_service.cc:758] Processing CreateTablet for tablet e884bda6bbd3482f94c07ca0f34f99a4 (table=quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]), partition=HASH (user_id) PARTITION 29, RANGE (dt) PARTITION "2018-11-10" <= VALUES < "2018-11-10\000" ... I1024 11:40:52.017539 137358 ts_tablet_manager.cc:1080] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Registered tablet (data state: TABLET_DATA_READY) ... I1024 11:41:22.392292 137355 tablet_service.cc:799] Processing DeleteTablet for tablet e884bda6bbd3482f94c07ca0f34f99a4 with delete_type TABLET_DATA_DELETED (Replaced by 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST) from {username='kudu'} at 10.120.219.118:50247 ... I1024 12:03:31.079942 126376 ts_tablet_manager.cc:938] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Bootstrapping tablet I1024 12:03:31.079965 126376 tablet_bootstrap.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Bootstrap starting. ... I1024 12:03:31.080237 126376 tablet_bootstrap.cc:581] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No blocks or log segments found. Creating new log. ... I1024 12:03:31.080834 126376 tablet_bootstrap.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No bootstrap required, opened a new log I1024 12:03:31.080870 126376 ts_tablet_manager.cc:955] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time spent bootstrapping tablet: real 0.001s user 0.004s sys 0.000s I1024 12:03:31.080987 137368 tablet_service.cc:799] Processing DeleteTablet for tablet 0ba3a2963c524d859a85ad5fbce5bf96 with delete_type TABLET_DATA_DELETED (Replaced by 539e0ab40ec1442a94b330c18521d27f at 2018-10-24 11:42:23 CST) from {username='kudu'} at 10.120.219.118:50247 I1024 12:03:31.080996 126376 raft_consensus.cc:304] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Replica starting. Triggering 0 pending transactions. Active config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } } I1024 12:03:31.081007 126376 raft_consensus.cc:330] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Consensus starting up: Expiring failure detector timer to make a prompt election more likely I1024 12:03:31.081014 126376 raft_consensus.cc:605] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Becoming Follower/Learner. State: Replica: 39f15fcf42ef45bba0c95a3223dc25ee, State: Initialized, Role: FOLLOWER I1024 12:03:31.081053 126376 consensus_queue.cc:226] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [NON_LEADER]: Queue going to NON_LEADER mode. State: All replicated index: 0, Majority replicated index: 0, Committed index: 0, Last appended: 0.0, Last appended by leader: 0, Current term: 0, Majority size: -1, State: 0, Mode: NON_LEADER, active raft config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } } ... I1024 12:03:35.315024 161474 raft_consensus.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout) I1024 12:03:35.315050 161474 raft_consensus.cc:2700] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing failure detection for 2.642s (starting election) I1024 12:03:35.315111 161474 raft_consensus.cc:458] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Starting pre-election with config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } } I1024 12:03:35.315141 137342 tablet_service.cc:799] Processing DeleteTablet for tablet c3f47046905e4c0f9ffe21a9e7a5bc8b with delete_type TABLET_DATA_DELETED (Replaced by 28f11e17ecdc45d2a4d72c74afc2a488 at 2018-10-24 11:45:27 CST) from {username='kudu'} at 10.120.219.118:50247 I1024 12:03:35.315763 161474 leader_election.cc:231] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer cd75ddaabaa8487b9fe1feab9d2cba83 I1024 12:03:35.315790 161474 leader_election.cc:231] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer 061221dd1c3e40a3a3338afc74bb66f5 W1024 12:03:35.315987 137266 leader_election.cc:293] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to peer cd75ddaabaa8487b9fe1feab9d2cba83: Illegal state: must be running to vote when last-logged opid is not known W1024 12:03:35.316056 137267 leader_election.cc:293] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to peer 061221dd1c3e40a3a3338afc74bb66f5: Illegal state: must be running to vote when last-logged opid is not known I1024 12:03:35.316074 137267 leader_election.cc:258] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Election decided. Result: candidate lost. I1024 12:03:35.316102 161474 raft_consensus.cc:2700] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing failure detection for 4.310s (election complete) I1024 12:03:35.316115 161474 raft_consensus.cc:2455] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Leader pre-election lost for term 1. Reason: could not achieve majority ... I1024 12:04:01.303092 126376 ts_tablet_manager.cc:983] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time spent starting tablet: real 30.222s user 0.000s sys 0.000s W1024 12:04:01.303108 126376 ts_tablet_manager.cc:1006] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Tablet startup took 30222ms ... W1024 12:04:01.303225 126376 ts_tablet_manager.cc:1008] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Trace: 1024 11:40:51.956652 (+ 0us) service_pool.cc:163] Inserting onto call queue 1024 11:40:52.014539 (+ 57887us) service_pool.cc:222] Handling call 1024 11:40:52.014585 (+ 46us) ts_tablet_manager.cc:357] Acquired tablet manager lock 1024 11:40:52.014588 (+ 3us) ts_tablet_manager.cc:370] Creating new metadata... 1024 11:40:52.016532 (+ 1944us) tablet_metadata.cc:556] Metadata flushed 1024 11:40:52.017445 (+ 913us) tablet_replica.cc:143] Creating consensus instance 1024 11:40:52.017601 (+ 156us) inbound_call.cc:157] Queueing success response 1024 12:03:31.079950 (+1359062349us) ts_tablet_manager.cc:939] Bootstrapping tablet 1024 12:03:31.080879 (+ 929us) ts_tablet_manager.cc:984] Starting tablet replica 1024 12:03:31.080893 (+ 14us) tablet_replica.cc:195] Starting instrumentation 1024 12:03:31.080900 (+ 7us) tablet_replica.cc:207] Starting consensus Metrics: {"fdatasync":1,"fdatasync_us":238,"mutex_wait_us":30221902,"raft.queue_time_us":6,"raft.run_cpu_time_us":4,"raft.run_wall_time_us":5,"tablet-open.queue_time_us":1359062354} {code} {color:#ff0000}Tablet startup took 30+ seconds.{color} It was a long time for him to wait, so he did "ctrl+c". But we found that the tablets in 'INITIALIZED' status was growing rapidly, -half- an hour later it was 350,000:( {color:#ff0000}100(hash) * 45(range) * 3(RF) * (60(minute) * 60(second) / 30(repeat/second)) / 5(tservers) = 324000 (tablets/tserver).{color} We deleted this table by kudu client tool, and found that the number of 'INITIALIZED' tablets was going down slowly. By simple estimating it will take 10+ days to be back to normal. But luckily, the application system are not affected. {code:java} ------------------------------------------------------Kudu-master log with another tablet uuid I1031 16:21:21.644222 180146 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_DELETED) for tablet d1fd56be8eef44e782d509a0eeae9c15 on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by ff4fd0a538944d69b8a6beea81e5bb01 at 2018-10-24 12:39:17 CST) W1031 16:21:21.644421 180146 catalog_manager.cc:2892] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for tablet d1fd56be8eef44e782d509a0eeae9c15 with error code TABLET_NOT_RUNNING: Already present: State transition of tablet d1fd56be8eef44e782d509a0eeae9c15 already in progress: creating tablet I1031 16:21:21.644436 180146 catalog_manager.cc:2700] Scheduling retry of d1fd56be8eef44e782d509a0eeae9c15 Delete Tablet RPC for TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 553 ms (attempt = 6) ------------------------------------------------------Kudu-tserver log I1031 16:21:22.197888 137341 tablet_service.cc:799] Processing DeleteTablet for tablet d1fd56be8eef44e782d509a0eeae9c15 with delete_type TABLET_DATA_DELETED (Replaced by ff4fd0a538944d69b8a6beea81e5bb01 at 2018-10-24 12:39:17 CST) from {username='kudu'} at 10.120.219.118:50247 I1031 16:21:22.230309 137131 maintenance_manager.cc:492] P 39f15fcf42ef45bba0c95a3223dc25ee: FlushDeltaMemStoresOp(70499bc0f9ac4d8196ae5a0be6ef0b8b) complete. Timing: real 0.416s user 0.404s sys 0.008s Metrics: {"fdatasync":3,"fdatasync_us":2583,"lbm_write_time_us":29,"lbm_writes_lt_1ms":4} I1031 16:21:22.321700 137341 tablet_service.cc:799] Processing DeleteTablet for tablet 74a30181dea9400a9bcfaeb56f83f379 with delete_type TABLET_DATA_DELETED (Replaced by 31e350fddea443048946f5a20d3171bd at 2018-10-31 16:21:13 CST) from {username='kudu'} at 10.120.219.118:50247 I1031 16:21:22.350440 137341 tablet_service.cc:799] Processing DeleteTablet for tablet 7c864af01309432c9a2a4d1c88bbe52b with delete_type TABLET_DATA_DELETED (Replaced by ec4b733818d940e0af32c51bda3c7^C {code} was: I have met this problem again on 2018/10/26. And now the kudu version is 1.7.2. ----------------------------------------------------- We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and there are some load on the kudu cluster. Then someone else created a big table which had tens of thousands of tablets from impala-shell (that was a mistake). {code:java} CREATE TABLE XXX( ... PRIMARY KEY (...) ) PARTITION BY HASH (...) PARTITIONS 100, RANGE (...) ( PARTITION "2018-10-24" <= VALUES < "2018-10-24\000", PARTITION "2018-10-25" <= VALUES < "2018-10-25\000", ... PARTITION "2018-12-07" <= VALUES < "2018-12-07\000" ) STORED AS KUDU TBLPROPERTIES ('kudu.master_addresses'= '...'); {code} Here are the logs after creating table (only pick one tablet as example): {code:java} ------------------------------------------------------Kudu-master log ======e884bda6bbd3482f94c07ca0f34f99a4====== W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService from 10.120.219.118:50247 dropped due to backpressure. The service queue is full; it has 512 items. I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1) ... ======Be replaced by 0b144c00f35d48cca4d4981698faef72====== W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72 ... I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Sending DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4 ... I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST) ... W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 already in progress: creating tablet ... I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1) ... W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService from 10.120.219.118:59735 dropped due to backpressure. The service queue is full; it has 512 items. I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1) ... ======Be replaced by c0e0acc448fc42fc9e48f5025b112a75====== W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed timeout. Replacing with a new tablet c0e0acc448fc42fc9e48f5025b112a75 ... ------------------------------------------------------Kudu-tserver log I1024 11:40:52.014571 137358 tablet_service.cc:758] Processing CreateTablet for tablet e884bda6bbd3482f94c07ca0f34f99a4 (table=quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]), partition=HASH (user_id) PARTITION 29, RANGE (dt) PARTITION "2018-11-10" <= VALUES < "2018-11-10\000" ... I1024 11:40:52.017539 137358 ts_tablet_manager.cc:1080] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Registered tablet (data state: TABLET_DATA_READY) ... I1024 11:41:22.392292 137355 tablet_service.cc:799] Processing DeleteTablet for tablet e884bda6bbd3482f94c07ca0f34f99a4 with delete_type TABLET_DATA_DELETED (Replaced by 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST) from {username='kudu'} at 10.120.219.118:50247 ... I1024 12:03:31.079942 126376 ts_tablet_manager.cc:938] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Bootstrapping tablet I1024 12:03:31.079965 126376 tablet_bootstrap.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Bootstrap starting. ... I1024 12:03:31.080237 126376 tablet_bootstrap.cc:581] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No blocks or log segments found. Creating new log. ... I1024 12:03:31.080834 126376 tablet_bootstrap.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No bootstrap required, opened a new log I1024 12:03:31.080870 126376 ts_tablet_manager.cc:955] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time spent bootstrapping tablet: real 0.001s user 0.004s sys 0.000s I1024 12:03:31.080987 137368 tablet_service.cc:799] Processing DeleteTablet for tablet 0ba3a2963c524d859a85ad5fbce5bf96 with delete_type TABLET_DATA_DELETED (Replaced by 539e0ab40ec1442a94b330c18521d27f at 2018-10-24 11:42:23 CST) from {username='kudu'} at 10.120.219.118:50247 I1024 12:03:31.080996 126376 raft_consensus.cc:304] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Replica starting. Triggering 0 pending transactions. Active config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } } I1024 12:03:31.081007 126376 raft_consensus.cc:330] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Consensus starting up: Expiring failure detector timer to make a prompt election more likely I1024 12:03:31.081014 126376 raft_consensus.cc:605] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Becoming Follower/Learner. State: Replica: 39f15fcf42ef45bba0c95a3223dc25ee, State: Initialized, Role: FOLLOWER I1024 12:03:31.081053 126376 consensus_queue.cc:226] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [NON_LEADER]: Queue going to NON_LEADER mode. State: All replicated index: 0, Majority replicated index: 0, Committed index: 0, Last appended: 0.0, Last appended by leader: 0, Current term: 0, Majority size: -1, State: 0, Mode: NON_LEADER, active raft config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } } ... I1024 12:03:35.315024 161474 raft_consensus.cc:436] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Starting pre-election (no leader contacted us within the election timeout) I1024 12:03:35.315050 161474 raft_consensus.cc:2700] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing failure detection for 2.642s (starting election) I1024 12:03:35.315111 161474 raft_consensus.cc:458] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Starting pre-election with config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: "kudu5.lt.163.org" port: 7050 } } I1024 12:03:35.315141 137342 tablet_service.cc:799] Processing DeleteTablet for tablet c3f47046905e4c0f9ffe21a9e7a5bc8b with delete_type TABLET_DATA_DELETED (Replaced by 28f11e17ecdc45d2a4d72c74afc2a488 at 2018-10-24 11:45:27 CST) from {username='kudu'} at 10.120.219.118:50247 I1024 12:03:35.315763 161474 leader_election.cc:231] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer cd75ddaabaa8487b9fe1feab9d2cba83 I1024 12:03:35.315790 161474 leader_election.cc:231] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer 061221dd1c3e40a3a3338afc74bb66f5 W1024 12:03:35.315987 137266 leader_election.cc:293] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to peer cd75ddaabaa8487b9fe1feab9d2cba83: Illegal state: must be running to vote when last-logged opid is not known W1024 12:03:35.316056 137267 leader_election.cc:293] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to peer 061221dd1c3e40a3a3338afc74bb66f5: Illegal state: must be running to vote when last-logged opid is not known I1024 12:03:35.316074 137267 leader_election.cc:258] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [CANDIDATE]: Term 1 pre-election: Election decided. Result: candidate lost. I1024 12:03:35.316102 161474 raft_consensus.cc:2700] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing failure detection for 4.310s (election complete) I1024 12:03:35.316115 161474 raft_consensus.cc:2455] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 FOLLOWER]: Leader pre-election lost for term 1. Reason: could not achieve majority ... I1024 12:04:01.303092 126376 ts_tablet_manager.cc:983] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time spent starting tablet: real 30.222s user 0.000s sys 0.000s W1024 12:04:01.303108 126376 ts_tablet_manager.cc:1006] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Tablet startup took 30222ms ... W1024 12:04:01.303225 126376 ts_tablet_manager.cc:1008] T e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Trace: 1024 11:40:51.956652 (+ 0us) service_pool.cc:163] Inserting onto call queue 1024 11:40:52.014539 (+ 57887us) service_pool.cc:222] Handling call 1024 11:40:52.014585 (+ 46us) ts_tablet_manager.cc:357] Acquired tablet manager lock 1024 11:40:52.014588 (+ 3us) ts_tablet_manager.cc:370] Creating new metadata... 1024 11:40:52.016532 (+ 1944us) tablet_metadata.cc:556] Metadata flushed 1024 11:40:52.017445 (+ 913us) tablet_replica.cc:143] Creating consensus instance 1024 11:40:52.017601 (+ 156us) inbound_call.cc:157] Queueing success response 1024 12:03:31.079950 (+1359062349us) ts_tablet_manager.cc:939] Bootstrapping tablet 1024 12:03:31.080879 (+ 929us) ts_tablet_manager.cc:984] Starting tablet replica 1024 12:03:31.080893 (+ 14us) tablet_replica.cc:195] Starting instrumentation 1024 12:03:31.080900 (+ 7us) tablet_replica.cc:207] Starting consensus Metrics: {"fdatasync":1,"fdatasync_us":238,"mutex_wait_us":30221902,"raft.queue_time_us":6,"raft.run_cpu_time_us":4,"raft.run_wall_time_us":5,"tablet-open.queue_time_us":1359062354} {code} {color:#FF0000}Tablet startup took 30+ seconds.{color} It was a long time for him to wait, so he did "ctrl+c". But we found that the tablets in 'INITIALIZED' status was growing rapidly, -half- an hour later it was 350,000:( {color:#FF0000}100(hash) * 45(range) * 3(RF) * (60(minute) * 60(second) / 30(repeat/second)) / 5(tservers) = 324000 (tablets/tserver).{color} We deleted this table by kudu client tool, and found that the number of 'INITIALIZED' tablets was going down slowly. By simple estimating it will take 10+ days to be back to normal. But luckily, the application system are not affected. {code:java} ------------------------------------------------------Kudu-master log with another tablet uuid I1031 16:21:21.644222 180146 catalog_manager.cc:2922] Sending DeleteTablet(TABLET_DATA_DELETED) for tablet d1fd56be8eef44e782d509a0eeae9c15 on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by ff4fd0a538944d69b8a6beea81e5bb01 at 2018-10-24 12:39:17 CST) W1031 16:21:21.644421 180146 catalog_manager.cc:2892] TS 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for tablet d1fd56be8eef44e782d509a0eeae9c15 with error code TABLET_NOT_RUNNING: Already present: State transition of tablet d1fd56be8eef44e782d509a0eeae9c15 already in progress: creating tablet I1031 16:21:21.644436 180146 catalog_manager.cc:2700] Scheduling retry of d1fd56be8eef44e782d509a0eeae9c15 Delete Tablet RPC for TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 553 ms (attempt = 6) ------------------------------------------------------Kudu-tserver log I1031 16:21:22.197888 137341 tablet_service.cc:799] Processing DeleteTablet for tablet d1fd56be8eef44e782d509a0eeae9c15 with delete_type TABLET_DATA_DELETED (Replaced by ff4fd0a538944d69b8a6beea81e5bb01 at 2018-10-24 12:39:17 CST) from {username='kudu'} at 10.120.219.118:50247 I1031 16:21:22.230309 137131 maintenance_manager.cc:492] P 39f15fcf42ef45bba0c95a3223dc25ee: FlushDeltaMemStoresOp(70499bc0f9ac4d8196ae5a0be6ef0b8b) complete. Timing: real 0.416s user 0.404s sys 0.008s Metrics: {"fdatasync":3,"fdatasync_us":2583,"lbm_write_time_us":29,"lbm_writes_lt_1ms":4} I1031 16:21:22.321700 137341 tablet_service.cc:799] Processing DeleteTablet for tablet 74a30181dea9400a9bcfaeb56f83f379 with delete_type TABLET_DATA_DELETED (Replaced by 31e350fddea443048946f5a20d3171bd at 2018-10-31 16:21:13 CST) from {username='kudu'} at 10.120.219.118:50247 I1031 16:21:22.350440 137341 tablet_service.cc:799] Processing DeleteTablet for tablet 7c864af01309432c9a2a4d1c88bbe52b with delete_type TABLET_DATA_DELETED (Replaced by ec4b733818d940e0af32c51bda3c7^C {code} > kudu should stop creating tablet infinitely > ------------------------------------------- > > Key: KUDU-2453 > URL: https://issues.apache.org/jira/browse/KUDU-2453 > Project: Kudu > Issue Type: Bug > Components: master, tserver > Affects Versions: 1.4.0, 1.7.2 > Reporter: HeLifu > Priority: Major > > I have met this problem again on 2018/10/26. And now the kudu version is > 1.7.2. > ----------------------------------------------------- > We modified the flag 'max_create_tablets_per_ts' (2000) of master.conf, and > there are some load on the kudu cluster. Then someone else created a big > table which had tens of thousands of tablets from impala-shell (that was a > mistake). > {code:java} > CREATE TABLE XXX( > ... > PRIMARY KEY (...) > ) > PARTITION BY HASH (...) PARTITIONS 100, > RANGE (...) > ( > PARTITION "2018-10-24" <= VALUES < "2018-10-24\000", > PARTITION "2018-10-25" <= VALUES < "2018-10-25\000", > ... > PARTITION "2018-12-07" <= VALUES < "2018-12-07\000" > ) > STORED AS KUDU > TBLPROPERTIES ('kudu.master_addresses'= '...'); > {code} > Here are the logs after creating table (only pick one tablet as example): > {code:java} > ------------------------------------------------------Kudu-master log > ======e884bda6bbd3482f94c07ca0f34f99a4====== > W1024 11:40:51.914397 180146 catalog_manager.cc:2664] TS > 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): Create Tablet RPC > failed for tablet e884bda6bbd3482f94c07ca0f34f99a4: Remote error: Service > unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService > from 10.120.219.118:50247 dropped due to backpressure. The service queue is > full; it has 512 items. > I1024 11:40:51.914412 180146 catalog_manager.cc:2700] Scheduling retry of > CreateTablet RPC for tablet e884bda6bbd3482f94c07ca0f34f99a4 on TS > 39f15fcf42ef45bba0c95a3223dc25ee with a delay of 42 ms (attempt = 1) > ... > ======Be replaced by 0b144c00f35d48cca4d4981698faef72====== > W1024 11:41:22.114512 180202 catalog_manager.cc:3949] T > 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet > e884bda6bbd3482f94c07ca0f34f99a4 (table quasi_realtime_user_feature > [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed > timeout. Replacing with a new tablet 0b144c00f35d48cca4d4981698faef72 > ... > I1024 11:41:22.391916 180202 catalog_manager.cc:3806] T > 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Sending > DeleteTablet for 3 replicas of tablet e884bda6bbd3482f94c07ca0f34f99a4 > ... > I1024 11:41:22.391927 180202 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_DELETED) for tablet e884bda6bbd3482f94c07ca0f34f99a4 > on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by > 0b144c00f35d48cca4d4981698faef72 at 2018-10-24 11:41:22 CST) > ... > W1024 11:41:22.428129 180146 catalog_manager.cc:2892] TS > 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for > tablet e884bda6bbd3482f94c07ca0f34f99a4 with error code TABLET_NOT_RUNNING: > Already present: State transition of tablet e884bda6bbd3482f94c07ca0f34f99a4 > already in progress: creating tablet > ... > I1024 11:41:22.428143 180146 catalog_manager.cc:2700] Scheduling retry of > e884bda6bbd3482f94c07ca0f34f99a4 Delete Tablet RPC for > TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 35 ms (attempt = 1) > ... > W1024 11:41:22.683702 180145 catalog_manager.cc:2664] TS > b251540e606b4863bb576091ff961892 (kudu1.lt.163.org:7050): Create Tablet RPC > failed for tablet 0b144c00f35d48cca4d4981698faef72: Remote error: Service > unavailable: CreateTablet request on kudu.tserver.TabletServerAdminService > from 10.120.219.118:59735 dropped due to backpressure. The service queue is > full; it has 512 items. > I1024 11:41:22.683717 180145 catalog_manager.cc:2700] Scheduling retry of > CreateTablet RPC for tablet 0b144c00f35d48cca4d4981698faef72 on TS > b251540e606b4863bb576091ff961892 with a delay of 46 ms (attempt = 1) > ... > ======Be replaced by c0e0acc448fc42fc9e48f5025b112a75====== > W1024 11:41:52.775420 180202 catalog_manager.cc:3949] T > 00000000000000000000000000000000 P f6c9a09da7ef4fc191cab6276b942ba3: Tablet > 0b144c00f35d48cca4d4981698faef72 (table quasi_realtime_user_feature > [id=946d6dd03ec544eab96231e5a03bed59]) was not created within the allowed > timeout. Replacing with a new tablet c0e0acc448fc42fc9e48f5025b112a75 > ... > ------------------------------------------------------Kudu-tserver log > I1024 11:40:52.014571 137358 tablet_service.cc:758] Processing CreateTablet > for tablet e884bda6bbd3482f94c07ca0f34f99a4 > (table=quasi_realtime_user_feature [id=946d6dd03ec544eab96231e5a03bed59]), > partition=HASH (user_id) PARTITION 29, RANGE (dt) PARTITION "2018-11-10" <= > VALUES < "2018-11-10\000" > ... > I1024 11:40:52.017539 137358 ts_tablet_manager.cc:1080] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: > Registered tablet (data state: TABLET_DATA_READY) > ... > I1024 11:41:22.392292 137355 tablet_service.cc:799] Processing DeleteTablet > for tablet e884bda6bbd3482f94c07ca0f34f99a4 with delete_type > TABLET_DATA_DELETED (Replaced by 0b144c00f35d48cca4d4981698faef72 at > 2018-10-24 11:41:22 CST) from {username='kudu'} at 10.120.219.118:50247 > ... > I1024 12:03:31.079942 126376 ts_tablet_manager.cc:938] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: > Bootstrapping tablet > I1024 12:03:31.079965 126376 tablet_bootstrap.cc:436] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: > Bootstrap starting. > ... > I1024 12:03:31.080237 126376 tablet_bootstrap.cc:581] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No > blocks or log segments found. Creating new log. > ... > I1024 12:03:31.080834 126376 tablet_bootstrap.cc:436] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: No > bootstrap required, opened a new log > I1024 12:03:31.080870 126376 ts_tablet_manager.cc:955] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time > spent bootstrapping tablet: real 0.001s user 0.004s sys 0.000s > I1024 12:03:31.080987 137368 tablet_service.cc:799] Processing DeleteTablet > for tablet 0ba3a2963c524d859a85ad5fbce5bf96 with delete_type > TABLET_DATA_DELETED (Replaced by 539e0ab40ec1442a94b330c18521d27f at > 2018-10-24 11:42:23 CST) from {username='kudu'} at 10.120.219.118:50247 > I1024 12:03:31.080996 126376 raft_consensus.cc:304] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 > FOLLOWER]: Replica starting. Triggering 0 pending transactions. Active > config: opid_index: -1 OBSOLETE_local: false peers { permanent_uuid: > "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER last_known_addr { host: > "kudu2.lt.163.org" port: 7050 } } peers { permanent_uuid: > "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER last_known_addr { host: > "kudu3.lt.163.org" port: 7050 } } peers { permanent_uuid: > "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER last_known_addr { host: > "kudu5.lt.163.org" port: 7050 } } > I1024 12:03:31.081007 126376 raft_consensus.cc:330] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 > FOLLOWER]: Consensus starting up: Expiring failure detector timer to make a > prompt election more likely > I1024 12:03:31.081014 126376 raft_consensus.cc:605] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 > FOLLOWER]: Becoming Follower/Learner. State: Replica: > 39f15fcf42ef45bba0c95a3223dc25ee, State: Initialized, Role: FOLLOWER > I1024 12:03:31.081053 126376 consensus_queue.cc:226] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee > [NON_LEADER]: Queue going to NON_LEADER mode. State: All replicated index: 0, > Majority replicated index: 0, Committed index: 0, Last appended: 0.0, Last > appended by leader: 0, Current term: 0, Majority size: -1, State: 0, Mode: > NON_LEADER, active raft config: opid_index: -1 OBSOLETE_local: false peers { > permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: VOTER > last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { > permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER > last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { > permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER > last_known_addr { host: "kudu5.lt.163.org" port: 7050 } } > ... > I1024 12:03:35.315024 161474 raft_consensus.cc:436] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 > FOLLOWER]: Starting pre-election (no leader contacted us within the election > timeout) > I1024 12:03:35.315050 161474 raft_consensus.cc:2700] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing > failure detection for 2.642s (starting election) > I1024 12:03:35.315111 161474 raft_consensus.cc:458] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 > FOLLOWER]: Starting pre-election with config: opid_index: -1 OBSOLETE_local: > false peers { permanent_uuid: "39f15fcf42ef45bba0c95a3223dc25ee" member_type: > VOTER last_known_addr { host: "kudu2.lt.163.org" port: 7050 } } peers { > permanent_uuid: "cd75ddaabaa8487b9fe1feab9d2cba83" member_type: VOTER > last_known_addr { host: "kudu3.lt.163.org" port: 7050 } } peers { > permanent_uuid: "061221dd1c3e40a3a3338afc74bb66f5" member_type: VOTER > last_known_addr { host: "kudu5.lt.163.org" port: 7050 } } > I1024 12:03:35.315141 137342 tablet_service.cc:799] Processing DeleteTablet > for tablet c3f47046905e4c0f9ffe21a9e7a5bc8b with delete_type > TABLET_DATA_DELETED (Replaced by 28f11e17ecdc45d2a4d72c74afc2a488 at > 2018-10-24 11:45:27 CST) from {username='kudu'} at 10.120.219.118:50247 > I1024 12:03:35.315763 161474 leader_election.cc:231] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee > [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer > cd75ddaabaa8487b9fe1feab9d2cba83 > I1024 12:03:35.315790 161474 leader_election.cc:231] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee > [CANDIDATE]: Term 1 pre-election: Requesting pre-vote from peer > 061221dd1c3e40a3a3338afc74bb66f5 > W1024 12:03:35.315987 137266 leader_election.cc:293] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee > [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to > peer cd75ddaabaa8487b9fe1feab9d2cba83: Illegal state: must be running to vote > when last-logged opid is not known > W1024 12:03:35.316056 137267 leader_election.cc:293] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee > [CANDIDATE]: Term 1 pre-election: Tablet error from VoteRequest() call to > peer 061221dd1c3e40a3a3338afc74bb66f5: Illegal state: must be running to vote > when last-logged opid is not known > I1024 12:03:35.316074 137267 leader_election.cc:258] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee > [CANDIDATE]: Term 1 pre-election: Election decided. Result: candidate lost. > I1024 12:03:35.316102 161474 raft_consensus.cc:2700] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Snoozing > failure detection for 4.310s (election complete) > I1024 12:03:35.316115 161474 raft_consensus.cc:2455] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee [term 0 > FOLLOWER]: Leader pre-election lost for term 1. Reason: could not achieve > majority > ... > I1024 12:04:01.303092 126376 ts_tablet_manager.cc:983] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Time > spent starting tablet: real 30.222s user 0.000s sys 0.000s > W1024 12:04:01.303108 126376 ts_tablet_manager.cc:1006] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Tablet > startup took 30222ms > ... > W1024 12:04:01.303225 126376 ts_tablet_manager.cc:1008] T > e884bda6bbd3482f94c07ca0f34f99a4 P 39f15fcf42ef45bba0c95a3223dc25ee: Trace: > 1024 11:40:51.956652 (+ 0us) service_pool.cc:163] Inserting onto call queue > 1024 11:40:52.014539 (+ 57887us) service_pool.cc:222] Handling call > 1024 11:40:52.014585 (+ 46us) ts_tablet_manager.cc:357] Acquired tablet > manager lock > 1024 11:40:52.014588 (+ 3us) ts_tablet_manager.cc:370] Creating new > metadata... > 1024 11:40:52.016532 (+ 1944us) tablet_metadata.cc:556] Metadata flushed > 1024 11:40:52.017445 (+ 913us) tablet_replica.cc:143] Creating consensus > instance > 1024 11:40:52.017601 (+ 156us) inbound_call.cc:157] Queueing success response > 1024 12:03:31.079950 (+1359062349us) ts_tablet_manager.cc:939] Bootstrapping > tablet > 1024 12:03:31.080879 (+ 929us) ts_tablet_manager.cc:984] Starting tablet > replica > 1024 12:03:31.080893 (+ 14us) tablet_replica.cc:195] Starting instrumentation > 1024 12:03:31.080900 (+ 7us) tablet_replica.cc:207] Starting consensus > Metrics: > {"fdatasync":1,"fdatasync_us":238,"mutex_wait_us":30221902,"raft.queue_time_us":6,"raft.run_cpu_time_us":4,"raft.run_wall_time_us":5,"tablet-open.queue_time_us":1359062354} > {code} > {color:#ff0000}Tablet startup took 30+ seconds.{color} > It was a long time for him to wait, so he did "ctrl+c". But we found that the > tablets in 'INITIALIZED' status was growing rapidly, -half- an hour later it > was 350,000:( > {color:#ff0000}100(hash) * 45(range) * 3(RF) * (60(minute) * 60(second) / > 30(repeat/second)) / 5(tservers) = 324000 (tablets/tserver).{color} > We deleted this table by kudu client tool, and found that the number of > 'INITIALIZED' tablets was going down slowly. By simple estimating it will > take 10+ days to be back to normal. But luckily, the application system are > not affected. > {code:java} > ------------------------------------------------------Kudu-master log with > another tablet uuid > I1031 16:21:21.644222 180146 catalog_manager.cc:2922] Sending > DeleteTablet(TABLET_DATA_DELETED) for tablet d1fd56be8eef44e782d509a0eeae9c15 > on 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050) (Replaced by > ff4fd0a538944d69b8a6beea81e5bb01 at 2018-10-24 12:39:17 CST) > W1031 16:21:21.644421 180146 catalog_manager.cc:2892] TS > 39f15fcf42ef45bba0c95a3223dc25ee (kudu2.lt.163.org:7050): delete failed for > tablet d1fd56be8eef44e782d509a0eeae9c15 with error code TABLET_NOT_RUNNING: > Already present: State transition of tablet d1fd56be8eef44e782d509a0eeae9c15 > already in progress: creating tablet > I1031 16:21:21.644436 180146 catalog_manager.cc:2700] Scheduling retry of > d1fd56be8eef44e782d509a0eeae9c15 Delete Tablet RPC for > TS=39f15fcf42ef45bba0c95a3223dc25ee with a delay of 553 ms (attempt = 6) > ------------------------------------------------------Kudu-tserver log > I1031 16:21:22.197888 137341 tablet_service.cc:799] Processing DeleteTablet > for tablet d1fd56be8eef44e782d509a0eeae9c15 with delete_type > TABLET_DATA_DELETED (Replaced by ff4fd0a538944d69b8a6beea81e5bb01 at > 2018-10-24 12:39:17 CST) from {username='kudu'} at 10.120.219.118:50247 > I1031 16:21:22.230309 137131 maintenance_manager.cc:492] P > 39f15fcf42ef45bba0c95a3223dc25ee: > FlushDeltaMemStoresOp(70499bc0f9ac4d8196ae5a0be6ef0b8b) complete. Timing: > real 0.416s user 0.404s sys 0.008s Metrics: > {"fdatasync":3,"fdatasync_us":2583,"lbm_write_time_us":29,"lbm_writes_lt_1ms":4} > I1031 16:21:22.321700 137341 tablet_service.cc:799] Processing DeleteTablet > for tablet 74a30181dea9400a9bcfaeb56f83f379 with delete_type > TABLET_DATA_DELETED (Replaced by 31e350fddea443048946f5a20d3171bd at > 2018-10-31 16:21:13 CST) from {username='kudu'} at 10.120.219.118:50247 > I1031 16:21:22.350440 137341 tablet_service.cc:799] Processing DeleteTablet > for tablet 7c864af01309432c9a2a4d1c88bbe52b with delete_type > TABLET_DATA_DELETED (Replaced by ec4b733818d940e0af32c51bda3c7^C > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)