We are seeing some TCP Resets on all nodes at the same time, and sometimes
quite a lot of them. We have yet to correlate the pauses to the TCP resets
but I am starting to wonder if this is partly a network problem. Does
Gigabit Ethernet break down on high volume nodes? Do high volume nodes use
10G or Infiniband?


On Wed, Jan 12, 2011 at 1:52 PM, Stack <st...@duboce.net> wrote:

> Jon asks that you describe your loading in the issue.  Would you mind
> doing so.  Ted, stick up in the issue the workload and configs. you
> are running if you don't mind.  I'd like to try it over here.
> Thanks lads,
> St.Ack
>
>
> On Wed, Jan 12, 2011 at 9:03 AM, Wayne <wav...@gmail.com> wrote:
> > Added: https://issues.apache.org/jira/browse/HBASE-3438.
> >
> > On Wed, Jan 12, 2011 at 11:40 AM, Wayne <wav...@gmail.com> wrote:
> >
> >> We are using 0.89.20100924, r1001068
> >>
> >> We are seeing see it during heavy write load (which is all the time),
> but
> >> yesterday we had read load as well as write load and saw both reads and
> >> writes stop for 10+ seconds. The region size is the biggest clue we have
> >> found from our tests as setting up a new cluster with a 1GB max region
> size
> >> and starting to load heavily we will see this a lot for long long time
> >> frames. Maybe the bigger file gets hung up more easily with a split?
> Your
> >> description below also fits in that early on the load is not balanced so
> it
> >> is easier to stop everything on one node as the balance is not great
> early
> >> on. I will file a JIRA. I will also try to dig deeper into the logs
> during
> >> the pauses to find a node that might be stuck in a split.
> >>
> >>
> >>
> >> On Wed, Jan 12, 2011 at 11:17 AM, Stack <st...@duboce.net> wrote:
> >>
> >>> On Tue, Jan 11, 2011 at 2:34 PM, Wayne <wav...@gmail.com> wrote:
> >>> >  We have very frequent cluster wide pauses that stop all reads and
> >>> writes
> >>> > for seconds.
> >>>
> >>> All reads and all writes?
> >>>
> >>> I've seen the pause too for writes.  Its something I've always meant
> >>> to look into.  Friso postulates one cause.  Another that we've talked
> >>> of is a region taking a while to come back on line after a split or a
> >>> rebalance for whatever reason.  Client loading might be 'random'
> >>> spraying over lots of random regions but they all get stuck waiting on
> >>> one particular region to come back online.
> >>>
> >>> I suppose reads could be blocked for same reason if all are trying to
> >>> read from the offlined region.
> >>>
> >>> What version of hbase are you using?  Splits should be faster in 0.90
> >>> now that the split daughters come up on the same region.
> >>>
> >>> Sorry I don't have a better answer for you.  Need to dig in.
> >>>
> >>> File a JIRA.  If you want to help out some, stick some data up in it.
> >>> Some suggestions would be to enable logging of when we lookup region
> >>> locations in client and then note when requests go to zero.  Can you
> >>> figure what region the clients are waiting on (if they are waiting on
> >>> any).  If you can pull out a particular one, try and elicit its
> >>> history at time of blockage.  Is it being moved or mid-split?  I
> >>> suppose it makes sense that bigger regions would make the situation
> >>> 'worse'.  I can take a look at it too.
> >>>
> >>> St.Ack
> >>>
> >>>
> >>>
> >>>
> >>> We are constantly loading data to this cluster of 10 nodes.
> >>> > These pauses can happen as frequently as every minute but sometimes
> are
> >>> not
> >>> > seen for 15+ minutes. Basically watching the Region server list with
> >>> request
> >>> > counts is the only evidence of what is going on. All reads and writes
> >>> > totally stop and if there is ever any activity it is on the node
> hosting
> >>> the
> >>> > .META. table with a request count of region count + 1. This problem
> >>> seems to
> >>> > be worse with a larger region size. We tried a 1GB region size and
> saw
> >>> this
> >>> > more than we saw actual activity (and stopped using a larger region
> size
> >>> > because of it). We went back to the default region size and it was
> >>> better,
> >>> > but we had too many regions so now we are up to 512M for a region
> size
> >>> and
> >>> > we are seeing it more again.
> >>> >
> >>> > Does anyone know what this is? We have dug into all of the logs to
> find
> >>> some
> >>> > sort of pause but are not able to find anything. Is this an wal hlog
> >>> roll?
> >>> > Is this a region split or compaction? Of course our biggest fear is a
> GC
> >>> > pause on the master but we do not have java logging turned on with
> the
> >>> > master to tell. What could possibly stop the entire cluster from
> working
> >>> for
> >>> > seconds at a time very frequently?
> >>> >
> >>> > Thanks in advance for any ideas of what could be causing this.
> >>> >
> >>>
> >>
> >>
> >
>

Reply via email to