Hi Ted,

Finally we have another hotspot going on, same symptoms as before, here is
the pastebin for the stack trace from the region server that I obtained via
VisualVM:

http://pastebin.com/qbXPPrXk

Would really appreciate any insight you or anyone else can provide.

Thanks.

----
Saad


On Thu, Dec 1, 2016 at 6:08 PM, Saad Mufti <saad.mu...@gmail.com> wrote:

> Sure will, the next time it happens.
>
> Thanks!!!
>
> ----
> Saad
>
>
> On Thu, Dec 1, 2016 at 5:01 PM, Ted Yu <ted...@yahoo.com.invalid> wrote:
>
>> From #2 in the initial email, the hbase:meta might not be the cause for
>> the hotspot.
>>
>> Saad:
>> Can you pastebin stack trace of the hot region server when this happens
>> again ?
>>
>> Thanks
>>
>> > On Dec 2, 2016, at 4:48 AM, Saad Mufti <saad.mu...@gmail.com> wrote:
>> >
>> > We used a pre-split into 1024 regions at the start but we miscalculated
>> our
>> > data size, so there were still auto-splits storms at the beginning as
>> data
>> > size stabilized, it has ended up at around 9500 or so regions, plus a
>> few
>> > thousand regions for a few other tables (much smaller). But haven't had
>> any
>> > new auto-splits in a couple of months. And the hotspots only started
>> > happening recently.
>> >
>> > Our hashing scheme is very simple, we take the MD5 of the key, then
>> form a
>> > 4 digit prefix based on the first two bytes of the MD5 normalized to be
>> > within the range 0-1023 . I am fairly confident about this scheme
>> > especially since even during the hotspot we see no evidence so far that
>> any
>> > particular region is taking disproportionate traffic (based on Cloudera
>> > Manager per region charts on the hotspot server). Does that look like a
>> > reasonable scheme to randomize which region any give key goes to? And
>> the
>> > start of the hotspot doesn't seem to correspond to any region splitting
>> or
>> > moving from one server to another activity.
>> >
>> > Thanks.
>> >
>> > ----
>> > Saad
>> >
>> >
>> >> On Thu, Dec 1, 2016 at 3:32 PM, John Leach <jle...@splicemachine.com>
>> wrote:
>> >>
>> >> Saad,
>> >>
>> >> Region move or split causes client connections to simultaneously
>> refresh
>> >> their meta.
>> >>
>> >> Key word is supposed.  We have seen meta hot spotting from time to time
>> >> and on different versions at Splice Machine.
>> >>
>> >> How confident are you in your hashing algorithm?
>> >>
>> >> Regards,
>> >> John Leach
>> >>
>> >>
>> >>
>> >>> On Dec 1, 2016, at 2:25 PM, Saad Mufti <saad.mu...@gmail.com> wrote:
>> >>>
>> >>> No never thought about that. I just figured out how to locate the
>> server
>> >>> for that table after you mentioned it. We'll have to keep an eye on it
>> >> next
>> >>> time we have a hotspot to see if it coincides with the hotspot server.
>> >>>
>> >>> What would be the theory for how it could become a hotspot? Isn't the
>> >>> client supposed to cache it and only go back for a refresh if it hits
>> a
>> >>> region that is not in its expected location?
>> >>>
>> >>> ----
>> >>> Saad
>> >>>
>> >>>
>> >>> On Thu, Dec 1, 2016 at 2:56 PM, John Leach <jle...@splicemachine.com>
>> >> wrote:
>> >>>
>> >>>> Saad,
>> >>>>
>> >>>> Did you validate that Meta is not on the “Hot” region server?
>> >>>>
>> >>>> Regards,
>> >>>> John Leach
>> >>>>
>> >>>>
>> >>>>
>> >>>>> On Dec 1, 2016, at 1:50 PM, Saad Mufti <saad.mu...@gmail.com>
>> wrote:
>> >>>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> We are using HBase 1.0 on CDH 5.5.2 . We have taken great care to
>> avoid
>> >>>>> hotspotting due to inadvertent data patterns by prepending an MD5
>> >> based 4
>> >>>>> digit hash prefix to all our data keys. This works fine most of the
>> >>>> times,
>> >>>>> but more and more (as much as once or twice a day) recently we have
>> >>>>> occasions where one region server suddenly becomes "hot" (CPU above
>> or
>> >>>>> around 95% in various monitoring tools). When it happens it lasts
>> for
>> >>>>> hours, occasionally the hotspot might jump to another region server
>> as
>> >>>> the
>> >>>>> master decide the region is unresponsive and gives its region to
>> >> another
>> >>>>> server.
>> >>>>>
>> >>>>> For the longest time, we thought this must be some single rogue key
>> in
>> >>>> our
>> >>>>> input data that is being hammered. All attempts to track this down
>> have
>> >>>>> failed though, and the following behavior argues against this being
>> >>>>> application based:
>> >>>>>
>> >>>>> 1. plotted Get and Put rate by region on the "hot" region server in
>> >>>>> Cloudera Manager Charts, shows no single region is an outlier.
>> >>>>>
>> >>>>> 2. cleanly restarting just the region server process causes its
>> regions
>> >>>> to
>> >>>>> randomly migrate to other region servers, then it gets new ones from
>> >> the
>> >>>>> HBase master, basically a sort of shuffling, then the hotspot goes
>> >> away.
>> >>>> If
>> >>>>> it were application based, you'd expect the hotspot to just jump to
>> >>>> another
>> >>>>> region server.
>> >>>>>
>> >>>>> 3. have pored through region server logs and can't see anything out
>> of
>> >>>> the
>> >>>>> ordinary happening
>> >>>>>
>> >>>>> The only other pertinent thing to mention might be that we have a
>> >> special
>> >>>>> process of our own running outside the cluster that does cluster
>> wide
>> >>>> major
>> >>>>> compaction in a rolling fashion, where each batch consists of one
>> >> region
>> >>>>> from each region server, and it waits before one batch is completely
>> >> done
>> >>>>> before starting another. We have seen no real impact on the hotspot
>> >> from
>> >>>>> shutting this down and in normal times it doesn't impact our read or
>> >>>> write
>> >>>>> performance much.
>> >>>>>
>> >>>>> We are at our wit's end, anyone have experience with a scenario like
>> >>>> this?
>> >>>>> Any help/guidance would be most appreciated.
>> >>>>>
>> >>>>> -----
>> >>>>> Saad
>> >>>>
>> >>>>
>> >>
>> >>
>>
>
>

Reply via email to