I have got into the following scenario. I won't go into details of how I got here, since I am not able to reliably reproduce this scenario thus far. (Typically happens when some rs goes down because of hardware issues)
Let me explain to you the following details. Col 1: Region server on which region is trying to replicate Col 2: Region trying to replicate but stuck Col 3: SequenceID which is being replicated and stuck because previous range is not finished Col 4: Checkpoint in zk until which sequence id is already replicated to peer Col 5: Replication barriers for that region. This is a list of open sequence IDs on region movement. (+++ means where *checkpoint* belongs, --- is where *to replicate seqid* belongs) There are in total 53 regions and 10 regionservers RegionServer Region Trying to replicate sequenceID Replicated until Current Barriers rs-9 24c765b42253f96b550831d83e99cc9e 18775105 18762209 [17776286, +++ 18762210, 18775053, 18775079, 18775104, -- 18775119] rs-5 b4144bfe75c5826710ec54849741b038 189154192 189091221 [184183678, +++ 189117430, 189154191, -- 189154327] rs-8 deb6fee3380e7b9db9826cb5f27f8a59 189099509 189036510 [180662218, +++ 189062798, 189099508, -- 189099587] rs-8 3338fd34ae7ba06a7eccd89048fa83ce 189078951 189077722 [184170310, +++ 189078876, 189078950, -- 189104780, 189141509, 189141545, 189141595] rs-6 1af22c68b9212971ab2570e14b7b0dc2 183301002 183265047 [180239864, +++ 183265048, 183270357, 183277363, 183300886, 183301001, -- 183301062] rs-10 1af22c68b9212971ab2570e14b7b0dc2 183301063 183265047 [180239864, +++ 183265048, 183270357, 183277363, 183300886, 183301001, 183301062 --] rs-6 4b9e98c7eca7a24c74136de1aa8aeab0 189027036 189022619 [189022618, +++ 189027035, 189085155, 189085241, 189085290] rs-4 e45ba292df95edbdf884e2ec50cf5f16 189099081 189062191 [184126535, +++ 189098947, 189099080, -- 189099226] rs-4 83e65729dcad644738a0a3cee994e2df 189012454 189012365 [184103269, +++ 189012453, -- 189012538, 189074967, 189075016, 189075294, 189075349] rs-10 83e65729dcad644738a0a3cee994e2df 189012539 189012365 [184103269, +++ 189012453, 189012538, -- 189074967, 189075016, 189075294, 189075349] rs-3 11fca95de4878782af53371a25cf44d0 189121426 189058129 [180684344, +++ 189084916, 189121283, 189121425, -- 189121602] rs-3 b9db001578e127740d7e0e186e4fbab6 189145458 189081436 [184175242, +++ 189083026, 189145417, 189145457, -- 189145562, 189145723, 189145781] rs-2 262ca9ff7b878f32c451fac3eb430a88 189128535 189065879 [184159187, +++ 189091684, 189128534, -- 189128708] rs-2 03a1eb906a344944aad727dbb8210cfc 172392082 172390331 [167737983, +++ 172392081, -- 172400093, 172446121, 172446172] rs-10 ae2726c7b4eeec3f93336d71e80145a4 189027430 189026939 [184119428, +++ 189027429, -- 189053118, 189089933, 189089995, 189090059] rs-10 770ba4f4568fff803e6df340b2ffe486 189034144 189032879 [184127026, +++ 189034143, --189048295, 189059834, 189096413, 189096513, 189096548, 189096606] rs-1 5846f4ce8acdd5aabf325c847d18c729 18793501 18780639 [18778549, +++ 18784783, 18793471, 18793484, 18793500 --] rs-1 5846f4ce8acdd5aabf325c847d18c729 18793472 18780639 [18778549, +++ 18784783, 18793471, --- 18793484, 18793500] rs-1 fabd3ea591d5f20a86a26f8767d34f63 189028498 189024357 [184116531, +++ 189025318, 189028497, --- 189051176, 189087488, 189087737, 189087850] rs-1 335d855c5005343719ea73bcb7dcb269 189064849 189037338 [184130122, +++ 189064848, --- 189101485, 189101698, 189101774] My question is, how do I recover from here? Any suggestions. Only thought is that I have to replay by writing some MR jobs / some scripts to read and replay selectively and update checkpoints. --- Mallikarjun