Re: Cluster Wide Pauses

Stack Wed, 12 Jan 2011 10:52:30 -0800

Jon asks that you describe your loading in the issue.  Would you mind
doing so.  Ted, stick up in the issue the workload and configs. you
are running if you don't mind.  I'd like to try it over here.
Thanks lads,
St.Ack



On Wed, Jan 12, 2011 at 9:03 AM, Wayne <wav...@gmail.com> wrote:
> Added: https://issues.apache.org/jira/browse/HBASE-3438.
>
> On Wed, Jan 12, 2011 at 11:40 AM, Wayne <wav...@gmail.com> wrote:
>
>> We are using 0.89.20100924, r1001068
>>
>> We are seeing see it during heavy write load (which is all the time), but
>> yesterday we had read load as well as write load and saw both reads and
>> writes stop for 10+ seconds. The region size is the biggest clue we have
>> found from our tests as setting up a new cluster with a 1GB max region size
>> and starting to load heavily we will see this a lot for long long time
>> frames. Maybe the bigger file gets hung up more easily with a split? Your
>> description below also fits in that early on the load is not balanced so it
>> is easier to stop everything on one node as the balance is not great early
>> on. I will file a JIRA. I will also try to dig deeper into the logs during
>> the pauses to find a node that might be stuck in a split.
>>
>>
>>
>> On Wed, Jan 12, 2011 at 11:17 AM, Stack <st...@duboce.net> wrote:
>>
>>> On Tue, Jan 11, 2011 at 2:34 PM, Wayne <wav...@gmail.com> wrote:
>>> >  We have very frequent cluster wide pauses that stop all reads and
>>> writes
>>> > for seconds.
>>>
>>> All reads and all writes?
>>>
>>> I've seen the pause too for writes.  Its something I've always meant
>>> to look into.  Friso postulates one cause.  Another that we've talked
>>> of is a region taking a while to come back on line after a split or a
>>> rebalance for whatever reason.  Client loading might be 'random'
>>> spraying over lots of random regions but they all get stuck waiting on
>>> one particular region to come back online.
>>>
>>> I suppose reads could be blocked for same reason if all are trying to
>>> read from the offlined region.
>>>
>>> What version of hbase are you using?  Splits should be faster in 0.90
>>> now that the split daughters come up on the same region.
>>>
>>> Sorry I don't have a better answer for you.  Need to dig in.
>>>
>>> File a JIRA.  If you want to help out some, stick some data up in it.
>>> Some suggestions would be to enable logging of when we lookup region
>>> locations in client and then note when requests go to zero.  Can you
>>> figure what region the clients are waiting on (if they are waiting on
>>> any).  If you can pull out a particular one, try and elicit its
>>> history at time of blockage.  Is it being moved or mid-split?  I
>>> suppose it makes sense that bigger regions would make the situation
>>> 'worse'.  I can take a look at it too.
>>>
>>> St.Ack
>>>
>>>
>>>
>>>
>>> We are constantly loading data to this cluster of 10 nodes.
>>> > These pauses can happen as frequently as every minute but sometimes are
>>> not
>>> > seen for 15+ minutes. Basically watching the Region server list with
>>> request
>>> > counts is the only evidence of what is going on. All reads and writes
>>> > totally stop and if there is ever any activity it is on the node hosting
>>> the
>>> > .META. table with a request count of region count + 1. This problem
>>> seems to
>>> > be worse with a larger region size. We tried a 1GB region size and saw
>>> this
>>> > more than we saw actual activity (and stopped using a larger region size
>>> > because of it). We went back to the default region size and it was
>>> better,
>>> > but we had too many regions so now we are up to 512M for a region size
>>> and
>>> > we are seeing it more again.
>>> >
>>> > Does anyone know what this is? We have dug into all of the logs to find
>>> some
>>> > sort of pause but are not able to find anything. Is this an wal hlog
>>> roll?
>>> > Is this a region split or compaction? Of course our biggest fear is a GC
>>> > pause on the master but we do not have java logging turned on with the
>>> > master to tell. What could possibly stop the entire cluster from working
>>> for
>>> > seconds at a time very frequently?
>>> >
>>> > Thanks in advance for any ideas of what could be causing this.
>>> >
>>>
>>
>>
>

Re: Cluster Wide Pauses

Reply via email to