Hi Ted, Finally we have another hotspot going on, same symptoms as before, here is the pastebin for the stack trace from the region server that I obtained via VisualVM:
http://pastebin.com/qbXPPrXk Would really appreciate any insight you or anyone else can provide. Thanks. ---- Saad On Thu, Dec 1, 2016 at 6:08 PM, Saad Mufti <saad.mu...@gmail.com> wrote: > Sure will, the next time it happens. > > Thanks!!! > > ---- > Saad > > > On Thu, Dec 1, 2016 at 5:01 PM, Ted Yu <ted...@yahoo.com.invalid> wrote: > >> From #2 in the initial email, the hbase:meta might not be the cause for >> the hotspot. >> >> Saad: >> Can you pastebin stack trace of the hot region server when this happens >> again ? >> >> Thanks >> >> > On Dec 2, 2016, at 4:48 AM, Saad Mufti <saad.mu...@gmail.com> wrote: >> > >> > We used a pre-split into 1024 regions at the start but we miscalculated >> our >> > data size, so there were still auto-splits storms at the beginning as >> data >> > size stabilized, it has ended up at around 9500 or so regions, plus a >> few >> > thousand regions for a few other tables (much smaller). But haven't had >> any >> > new auto-splits in a couple of months. And the hotspots only started >> > happening recently. >> > >> > Our hashing scheme is very simple, we take the MD5 of the key, then >> form a >> > 4 digit prefix based on the first two bytes of the MD5 normalized to be >> > within the range 0-1023 . I am fairly confident about this scheme >> > especially since even during the hotspot we see no evidence so far that >> any >> > particular region is taking disproportionate traffic (based on Cloudera >> > Manager per region charts on the hotspot server). Does that look like a >> > reasonable scheme to randomize which region any give key goes to? And >> the >> > start of the hotspot doesn't seem to correspond to any region splitting >> or >> > moving from one server to another activity. >> > >> > Thanks. >> > >> > ---- >> > Saad >> > >> > >> >> On Thu, Dec 1, 2016 at 3:32 PM, John Leach <jle...@splicemachine.com> >> wrote: >> >> >> >> Saad, >> >> >> >> Region move or split causes client connections to simultaneously >> refresh >> >> their meta. >> >> >> >> Key word is supposed. We have seen meta hot spotting from time to time >> >> and on different versions at Splice Machine. >> >> >> >> How confident are you in your hashing algorithm? >> >> >> >> Regards, >> >> John Leach >> >> >> >> >> >> >> >>> On Dec 1, 2016, at 2:25 PM, Saad Mufti <saad.mu...@gmail.com> wrote: >> >>> >> >>> No never thought about that. I just figured out how to locate the >> server >> >>> for that table after you mentioned it. We'll have to keep an eye on it >> >> next >> >>> time we have a hotspot to see if it coincides with the hotspot server. >> >>> >> >>> What would be the theory for how it could become a hotspot? Isn't the >> >>> client supposed to cache it and only go back for a refresh if it hits >> a >> >>> region that is not in its expected location? >> >>> >> >>> ---- >> >>> Saad >> >>> >> >>> >> >>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach <jle...@splicemachine.com> >> >> wrote: >> >>> >> >>>> Saad, >> >>>> >> >>>> Did you validate that Meta is not on the “Hot” region server? >> >>>> >> >>>> Regards, >> >>>> John Leach >> >>>> >> >>>> >> >>>> >> >>>>> On Dec 1, 2016, at 1:50 PM, Saad Mufti <saad.mu...@gmail.com> >> wrote: >> >>>>> >> >>>>> Hi, >> >>>>> >> >>>>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to >> avoid >> >>>>> hotspotting due to inadvertent data patterns by prepending an MD5 >> >> based 4 >> >>>>> digit hash prefix to all our data keys. This works fine most of the >> >>>> times, >> >>>>> but more and more (as much as once or twice a day) recently we have >> >>>>> occasions where one region server suddenly becomes "hot" (CPU above >> or >> >>>>> around 95% in various monitoring tools). When it happens it lasts >> for >> >>>>> hours, occasionally the hotspot might jump to another region server >> as >> >>>> the >> >>>>> master decide the region is unresponsive and gives its region to >> >> another >> >>>>> server. >> >>>>> >> >>>>> For the longest time, we thought this must be some single rogue key >> in >> >>>> our >> >>>>> input data that is being hammered. All attempts to track this down >> have >> >>>>> failed though, and the following behavior argues against this being >> >>>>> application based: >> >>>>> >> >>>>> 1. plotted Get and Put rate by region on the "hot" region server in >> >>>>> Cloudera Manager Charts, shows no single region is an outlier. >> >>>>> >> >>>>> 2. cleanly restarting just the region server process causes its >> regions >> >>>> to >> >>>>> randomly migrate to other region servers, then it gets new ones from >> >> the >> >>>>> HBase master, basically a sort of shuffling, then the hotspot goes >> >> away. >> >>>> If >> >>>>> it were application based, you'd expect the hotspot to just jump to >> >>>> another >> >>>>> region server. >> >>>>> >> >>>>> 3. have pored through region server logs and can't see anything out >> of >> >>>> the >> >>>>> ordinary happening >> >>>>> >> >>>>> The only other pertinent thing to mention might be that we have a >> >> special >> >>>>> process of our own running outside the cluster that does cluster >> wide >> >>>> major >> >>>>> compaction in a rolling fashion, where each batch consists of one >> >> region >> >>>>> from each region server, and it waits before one batch is completely >> >> done >> >>>>> before starting another. We have seen no real impact on the hotspot >> >> from >> >>>>> shutting this down and in normal times it doesn't impact our read or >> >>>> write >> >>>>> performance much. >> >>>>> >> >>>>> We are at our wit's end, anyone have experience with a scenario like >> >>>> this? >> >>>>> Any help/guidance would be most appreciated. >> >>>>> >> >>>>> ----- >> >>>>> Saad >> >>>> >> >>>> >> >> >> >> >> > >