[ https://issues.apache.org/jira/browse/KUDU-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984778#comment-16984778 ]
YifanZhang commented on KUDU-2992: ---------------------------------- I tried to reproduced this case by deleting many tablets at the same time, such as dropping a big table or some big partitions, I found many logs in master are like: {code:java} $ grep "Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b" kudu_master.c3-hadoop-kudu-prc-ct02.bj.work.log.INFO.20191128-213038.14672 I1129 11:21:42.995760 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:43.501857 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:44.394129 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:45.001634 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:45.618881 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:46.610380 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:47.086390 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:47.972025 14817 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:48.973754 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:49.514094 14811 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:50.040673 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:51.057112 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet I1129 11:21:51.800305 14810 catalog_manager.cc:4013] Got report from deleted tablet 71c50a73cddf4562b9b85477e4c2ea7b (table default.loadgen_auto_8f39ac625a834b02aaf994887917a49a [id=bfa28d8904b64d62ab6bc984c7bd1c0e]) (Partition dropped at 2019-11-29 11:21:07 CST): Sending delete request for this tablet {code} That means the master would send delete tablet requests when receiving reports from 'deleted' tablets, maybe we could do something to prevent this kind of duplicated requests. > Limit concurrent alter request of a table > ----------------------------------------- > > Key: KUDU-2992 > URL: https://issues.apache.org/jira/browse/KUDU-2992 > Project: Kudu > Issue Type: Improvement > Components: master > Reporter: Yingchun Lai > Priority: Major > > One of our production environment clusters cause an accident some days ago, > one user has a table, partition schema looks like: > {code:java} > HASH (uuid) PARTITIONS 80,RANGE (date_hour) ( > PARTITION 2019102900 <= VALUES < 2019102901, > PARTITION 2019102901 <= VALUES < 2019102902, > PARTITION 2019102902 <= VALUES < 2019102903, > PARTITION 2019102903 <= VALUES < 2019102904, > PARTITION 2019102904 <= VALUES < 2019102905, > ...) > {code} > He try to remove many outdated partitions once by SparkSQL, but it returns an > timeout error at first, then he try again and again, and SparkSQL failed > again and again. Then the cluster became unstable, memory usage and CPU load > increasing. > > I found many log like: > {code:java} > W1030 17:29:53.382287 7588 rpcz_store.cc:259] Trace: > 1030 17:26:19.714799 (+ 0us) service_pool.cc:162] Inserting onto call > queue > 1030 17:26:19.714808 (+ 9us) service_pool.cc:221] Handling call > 1030 17:29:53.382204 (+213667396us) ts_tablet_manager.cc:874] Deleting tablet > c52c5f43f7884d08b07fd0005e878fed > 1030 17:29:53.382205 (+ 1us) ts_tablet_manager.cc:794] Acquired tablet > manager lock > 1030 17:29:53.382208 (+ 3us) inbound_call.cc:162] Queueing success > response > Metrics: {"tablet-delete.queue_time_us":213667360} > W1030 17:29:53.382300 7586 rpcz_store.cc:253] Call > kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 > (request call id 1820316) took 213667 ms (3.56 min). Client timeout 29999 ms > (30 s) > W1030 17:29:53.382292 10623 rpcz_store.cc:253] Call > kudu.tserver.TabletServerAdminService.DeleteTablet from 10.152.49.21:55576 > (request call id 1820315) took 213667 ms (3.56 min). Client timeout 29999 ms > (30 s) > W1030 17:29:53.382297 10622 rpcz_store.cc:259] Trace: > 1030 17:26:19.714825 (+ 0us) service_pool.cc:162] Inserting onto call > queue > 1030 17:26:19.714833 (+ 8us) service_pool.cc:221] Handling call > 1030 17:29:53.382239 (+213667406us) ts_tablet_manager.cc:874] Deleting tablet > 479f8c592f16408c830637a0129359e1 > 1030 17:29:53.382241 (+ 2us) ts_tablet_manager.cc:794] Acquired tablet > manager lock > 1030 17:29:53.382244 (+ 3us) inbound_call.cc:162] Queueing success > response > Metrics: {"tablet-delete.queue_time_us":213667378} > ...{code} > That means 'Acquired tablet manager lock' cost much time, right? > {code:java} > Status TSTabletManager::BeginReplicaStateTransition( > const string& tablet_id, > const string& reason, > scoped_refptr<TabletReplica>* replica, > scoped_refptr<TransitionInProgressDeleter>* deleter, > TabletServerErrorPB::Code* error_code) { > // Acquire the lock in exclusive mode as we'll add a entry to the > // transition_in_progress_ map. > std::lock_guard<RWMutex> lock(lock_); > TRACE("Acquired tablet manager lock"); > RETURN_NOT_OK(CheckRunningUnlocked(error_code)); > ... > }{code} > But I think the root case is Kudu master send too many duplicate 'alter > table/delete tablet' request to tserver. I found more info in master's log: > {code:java} > $ grep "Scheduling retry of 8f8b354490684bf3a54e49a1478ec99d" > kudu_master.zjy-hadoop-prc-ct01.bj.work.log.INFO.20191030-204137.62788 | > egrep "attempt = 1\)" > I1030 20:41:42.207222 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 43 ms (attempt = 1) > I1030 20:41:42.207556 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 40 ms (attempt = 1) > I1030 20:41:42.260052 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 31 ms (attempt = 1) > I1030 20:41:42.278609 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 19 ms (attempt = 1) > I1030 20:41:42.312175 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 48 ms (attempt = 1) > I1030 20:41:42.318933 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 62 ms (attempt = 1) > I1030 20:41:42.340060 62821 catalog_manager.cc:2971] Scheduling retry of > 8f8b354490684bf3a54e49a1478ec99d Delete Tablet RPC for > TS=d50ddd2e763e4d5e81828a3807187b2e with a delay of 30 ms (attempt = 1) > ...{code} > That means master received too many duplicate 'delete tablet' request from > client, and then dispatch these request to tservers. > I think we should limit the concurrent alter request of a table, when a alter > request is on going and hasn't been finished, the following request should be > rejected. Or we should limit the size of TableInfo::pending_tasks_. > -- This message was sent by Atlassian Jira (v8.3.4#803005)