Hello,
I had an interesting problem come up recently. We have a few thousand
regions across 8 datanode/regionservers. I made a change, increasing
the heap size for hadoop from 128M to 2048M which ended up bringing the
cluster to a complete halt after about 1 hour. I reverted back to 128M
and turned things back on again but didn't realize at the time that I
came up with 9 fewer regions than I started. Upon further
investigation, I found that all 9 missing regions were from splits that
occurred while the cluster was running after making the heap change and
before it came to a halt. There was a 10th regions (5 splits involved
in total) that managed to get recovered. The really odd thing is that
in the case of the other 9 regions, the original parent regions, which
as far as I can tell in the logs were deleted, were re-opened upon
restarting things once again. The daughter regions were gone.
Interestingly, I found the orphaned datablocks still intact, and in at
least some cases have been able to extract the data from them and will
hopefully re-add it to the tables.
My question is this. Does anyone know based on the rather muddled
description I've given above, what could have possibly happened here?
My best guess is that the bad state that hdfs was in caused some
critical component of the split process to be missed, which resulted a
reference to the parent regions sticking around and losing the
references to the daughter regions.
Thanks for any insight you can provide.
--Brennon