I have got into the following scenario. I won't go into details of how I
got here, since I am not able to reliably reproduce this scenario thus far.
(Typically happens when some rs goes down because of hardware issues)

Let me explain to you the following details.
Col 1: Region server on which region is trying to replicate
Col 2: Region trying to replicate but stuck
Col 3: SequenceID which is being replicated and stuck because previous
range is not finished
Col 4: Checkpoint in zk until which sequence id is already replicated to
peer
Col 5: Replication barriers for that region. This is a list of open
sequence IDs on region movement. (+++ means where *checkpoint* belongs, ---
is where *to replicate seqid* belongs)

There are in total 53 regions and 10 regionservers

RegionServer Region Trying to replicate sequenceID Replicated until Current
Barriers
rs-9 24c765b42253f96b550831d83e99cc9e 18775105 18762209 [17776286, +++
18762210, 18775053, 18775079, 18775104, -- 18775119]
rs-5 b4144bfe75c5826710ec54849741b038 189154192 189091221 [184183678, +++
189117430, 189154191, -- 189154327]
rs-8 deb6fee3380e7b9db9826cb5f27f8a59 189099509 189036510 [180662218, +++
189062798, 189099508, -- 189099587]
rs-8 3338fd34ae7ba06a7eccd89048fa83ce 189078951 189077722 [184170310, +++
189078876, 189078950, -- 189104780, 189141509, 189141545, 189141595]
rs-6 1af22c68b9212971ab2570e14b7b0dc2 183301002 183265047 [180239864, +++
183265048, 183270357, 183277363, 183300886, 183301001, -- 183301062]
rs-10 1af22c68b9212971ab2570e14b7b0dc2 183301063 183265047 [180239864, +++
183265048, 183270357, 183277363, 183300886, 183301001, 183301062 --]
rs-6 4b9e98c7eca7a24c74136de1aa8aeab0 189027036 189022619 [189022618, +++
189027035, 189085155, 189085241, 189085290]
rs-4 e45ba292df95edbdf884e2ec50cf5f16 189099081 189062191 [184126535, +++
189098947, 189099080, -- 189099226]
rs-4 83e65729dcad644738a0a3cee994e2df 189012454 189012365 [184103269, +++
189012453, -- 189012538, 189074967, 189075016, 189075294, 189075349]
rs-10 83e65729dcad644738a0a3cee994e2df 189012539 189012365 [184103269, +++
189012453, 189012538, -- 189074967, 189075016, 189075294, 189075349]
rs-3 11fca95de4878782af53371a25cf44d0 189121426 189058129 [180684344, +++
189084916, 189121283, 189121425, -- 189121602]
rs-3 b9db001578e127740d7e0e186e4fbab6 189145458 189081436 [184175242, +++
189083026, 189145417, 189145457, -- 189145562, 189145723, 189145781]
rs-2 262ca9ff7b878f32c451fac3eb430a88 189128535 189065879 [184159187, +++
189091684, 189128534, -- 189128708]
rs-2 03a1eb906a344944aad727dbb8210cfc 172392082 172390331 [167737983, +++
172392081, -- 172400093, 172446121, 172446172]
rs-10 ae2726c7b4eeec3f93336d71e80145a4 189027430 189026939 [184119428, +++
189027429, -- 189053118, 189089933, 189089995, 189090059]
rs-10 770ba4f4568fff803e6df340b2ffe486 189034144 189032879 [184127026, +++
189034143, --189048295, 189059834, 189096413, 189096513, 189096548,
189096606]
rs-1 5846f4ce8acdd5aabf325c847d18c729 18793501 18780639 [18778549, +++
18784783, 18793471, 18793484, 18793500 --]
rs-1 5846f4ce8acdd5aabf325c847d18c729 18793472 18780639 [18778549, +++
18784783, 18793471, --- 18793484, 18793500]
rs-1 fabd3ea591d5f20a86a26f8767d34f63 189028498 189024357 [184116531, +++
189025318, 189028497, --- 189051176, 189087488, 189087737, 189087850]
rs-1 335d855c5005343719ea73bcb7dcb269 189064849 189037338 [184130122, +++
189064848, --- 189101485, 189101698, 189101774]


My question is, how do I recover from here? Any suggestions.

Only thought is that I have to replay by writing some MR jobs / some
scripts to read and replay selectively and update checkpoints.

---
Mallikarjun

Reply via email to