From #2 in the initial email, the hbase:meta might not be the cause for the hotspot.
Saad: Can you pastebin stack trace of the hot region server when this happens again ? Thanks > On Dec 2, 2016, at 4:48 AM, Saad Mufti <[email protected]> wrote: > > We used a pre-split into 1024 regions at the start but we miscalculated our > data size, so there were still auto-splits storms at the beginning as data > size stabilized, it has ended up at around 9500 or so regions, plus a few > thousand regions for a few other tables (much smaller). But haven't had any > new auto-splits in a couple of months. And the hotspots only started > happening recently. > > Our hashing scheme is very simple, we take the MD5 of the key, then form a > 4 digit prefix based on the first two bytes of the MD5 normalized to be > within the range 0-1023 . I am fairly confident about this scheme > especially since even during the hotspot we see no evidence so far that any > particular region is taking disproportionate traffic (based on Cloudera > Manager per region charts on the hotspot server). Does that look like a > reasonable scheme to randomize which region any give key goes to? And the > start of the hotspot doesn't seem to correspond to any region splitting or > moving from one server to another activity. > > Thanks. > > ---- > Saad > > >> On Thu, Dec 1, 2016 at 3:32 PM, John Leach <[email protected]> wrote: >> >> Saad, >> >> Region move or split causes client connections to simultaneously refresh >> their meta. >> >> Key word is supposed. We have seen meta hot spotting from time to time >> and on different versions at Splice Machine. >> >> How confident are you in your hashing algorithm? >> >> Regards, >> John Leach >> >> >> >>> On Dec 1, 2016, at 2:25 PM, Saad Mufti <[email protected]> wrote: >>> >>> No never thought about that. I just figured out how to locate the server >>> for that table after you mentioned it. We'll have to keep an eye on it >> next >>> time we have a hotspot to see if it coincides with the hotspot server. >>> >>> What would be the theory for how it could become a hotspot? Isn't the >>> client supposed to cache it and only go back for a refresh if it hits a >>> region that is not in its expected location? >>> >>> ---- >>> Saad >>> >>> >>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach <[email protected]> >> wrote: >>> >>>> Saad, >>>> >>>> Did you validate that Meta is not on the “Hot” region server? >>>> >>>> Regards, >>>> John Leach >>>> >>>> >>>> >>>>> On Dec 1, 2016, at 1:50 PM, Saad Mufti <[email protected]> wrote: >>>>> >>>>> Hi, >>>>> >>>>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to avoid >>>>> hotspotting due to inadvertent data patterns by prepending an MD5 >> based 4 >>>>> digit hash prefix to all our data keys. This works fine most of the >>>> times, >>>>> but more and more (as much as once or twice a day) recently we have >>>>> occasions where one region server suddenly becomes "hot" (CPU above or >>>>> around 95% in various monitoring tools). When it happens it lasts for >>>>> hours, occasionally the hotspot might jump to another region server as >>>> the >>>>> master decide the region is unresponsive and gives its region to >> another >>>>> server. >>>>> >>>>> For the longest time, we thought this must be some single rogue key in >>>> our >>>>> input data that is being hammered. All attempts to track this down have >>>>> failed though, and the following behavior argues against this being >>>>> application based: >>>>> >>>>> 1. plotted Get and Put rate by region on the "hot" region server in >>>>> Cloudera Manager Charts, shows no single region is an outlier. >>>>> >>>>> 2. cleanly restarting just the region server process causes its regions >>>> to >>>>> randomly migrate to other region servers, then it gets new ones from >> the >>>>> HBase master, basically a sort of shuffling, then the hotspot goes >> away. >>>> If >>>>> it were application based, you'd expect the hotspot to just jump to >>>> another >>>>> region server. >>>>> >>>>> 3. have pored through region server logs and can't see anything out of >>>> the >>>>> ordinary happening >>>>> >>>>> The only other pertinent thing to mention might be that we have a >> special >>>>> process of our own running outside the cluster that does cluster wide >>>> major >>>>> compaction in a rolling fashion, where each batch consists of one >> region >>>>> from each region server, and it waits before one batch is completely >> done >>>>> before starting another. We have seen no real impact on the hotspot >> from >>>>> shutting this down and in normal times it doesn't impact our read or >>>> write >>>>> performance much. >>>>> >>>>> We are at our wit's end, anyone have experience with a scenario like >>>> this? >>>>> Any help/guidance would be most appreciated. >>>>> >>>>> ----- >>>>> Saad >>>> >>>> >> >>
