We are using 0.89.20100924, r1001068

We are seeing see it during heavy write load (which is all the time), but
yesterday we had read load as well as write load and saw both reads and
writes stop for 10+ seconds. The region size is the biggest clue we have
found from our tests as setting up a new cluster with a 1GB max region size
and starting to load heavily we will see this a lot for long long time
frames. Maybe the bigger file gets hung up more easily with a split? Your
description below also fits in that early on the load is not balanced so it
is easier to stop everything on one node as the balance is not great early
on. I will file a JIRA. I will also try to dig deeper into the logs during
the pauses to find a node that might be stuck in a split.


On Wed, Jan 12, 2011 at 11:17 AM, Stack <st...@duboce.net> wrote:

> On Tue, Jan 11, 2011 at 2:34 PM, Wayne <wav...@gmail.com> wrote:
> >  We have very frequent cluster wide pauses that stop all reads and writes
> > for seconds.
>
> All reads and all writes?
>
> I've seen the pause too for writes.  Its something I've always meant
> to look into.  Friso postulates one cause.  Another that we've talked
> of is a region taking a while to come back on line after a split or a
> rebalance for whatever reason.  Client loading might be 'random'
> spraying over lots of random regions but they all get stuck waiting on
> one particular region to come back online.
>
> I suppose reads could be blocked for same reason if all are trying to
> read from the offlined region.
>
> What version of hbase are you using?  Splits should be faster in 0.90
> now that the split daughters come up on the same region.
>
> Sorry I don't have a better answer for you.  Need to dig in.
>
> File a JIRA.  If you want to help out some, stick some data up in it.
> Some suggestions would be to enable logging of when we lookup region
> locations in client and then note when requests go to zero.  Can you
> figure what region the clients are waiting on (if they are waiting on
> any).  If you can pull out a particular one, try and elicit its
> history at time of blockage.  Is it being moved or mid-split?  I
> suppose it makes sense that bigger regions would make the situation
> 'worse'.  I can take a look at it too.
>
> St.Ack
>
>
>
>
> We are constantly loading data to this cluster of 10 nodes.
> > These pauses can happen as frequently as every minute but sometimes are
> not
> > seen for 15+ minutes. Basically watching the Region server list with
> request
> > counts is the only evidence of what is going on. All reads and writes
> > totally stop and if there is ever any activity it is on the node hosting
> the
> > .META. table with a request count of region count + 1. This problem seems
> to
> > be worse with a larger region size. We tried a 1GB region size and saw
> this
> > more than we saw actual activity (and stopped using a larger region size
> > because of it). We went back to the default region size and it was
> better,
> > but we had too many regions so now we are up to 512M for a region size
> and
> > we are seeing it more again.
> >
> > Does anyone know what this is? We have dug into all of the logs to find
> some
> > sort of pause but are not able to find anything. Is this an wal hlog
> roll?
> > Is this a region split or compaction? Of course our biggest fear is a GC
> > pause on the master but we do not have java logging turned on with the
> > master to tell. What could possibly stop the entire cluster from working
> for
> > seconds at a time very frequently?
> >
> > Thanks in advance for any ideas of what could be causing this.
> >
>

Reply via email to