We are seeing some TCP Resets on all nodes at the same time, and sometimes quite a lot of them. We have yet to correlate the pauses to the TCP resets but I am starting to wonder if this is partly a network problem. Does Gigabit Ethernet break down on high volume nodes? Do high volume nodes use 10G or Infiniband?
On Wed, Jan 12, 2011 at 1:52 PM, Stack <st...@duboce.net> wrote: > Jon asks that you describe your loading in the issue. Would you mind > doing so. Ted, stick up in the issue the workload and configs. you > are running if you don't mind. I'd like to try it over here. > Thanks lads, > St.Ack > > > On Wed, Jan 12, 2011 at 9:03 AM, Wayne <wav...@gmail.com> wrote: > > Added: https://issues.apache.org/jira/browse/HBASE-3438. > > > > On Wed, Jan 12, 2011 at 11:40 AM, Wayne <wav...@gmail.com> wrote: > > > >> We are using 0.89.20100924, r1001068 > >> > >> We are seeing see it during heavy write load (which is all the time), > but > >> yesterday we had read load as well as write load and saw both reads and > >> writes stop for 10+ seconds. The region size is the biggest clue we have > >> found from our tests as setting up a new cluster with a 1GB max region > size > >> and starting to load heavily we will see this a lot for long long time > >> frames. Maybe the bigger file gets hung up more easily with a split? > Your > >> description below also fits in that early on the load is not balanced so > it > >> is easier to stop everything on one node as the balance is not great > early > >> on. I will file a JIRA. I will also try to dig deeper into the logs > during > >> the pauses to find a node that might be stuck in a split. > >> > >> > >> > >> On Wed, Jan 12, 2011 at 11:17 AM, Stack <st...@duboce.net> wrote: > >> > >>> On Tue, Jan 11, 2011 at 2:34 PM, Wayne <wav...@gmail.com> wrote: > >>> > We have very frequent cluster wide pauses that stop all reads and > >>> writes > >>> > for seconds. > >>> > >>> All reads and all writes? > >>> > >>> I've seen the pause too for writes. Its something I've always meant > >>> to look into. Friso postulates one cause. Another that we've talked > >>> of is a region taking a while to come back on line after a split or a > >>> rebalance for whatever reason. Client loading might be 'random' > >>> spraying over lots of random regions but they all get stuck waiting on > >>> one particular region to come back online. > >>> > >>> I suppose reads could be blocked for same reason if all are trying to > >>> read from the offlined region. > >>> > >>> What version of hbase are you using? Splits should be faster in 0.90 > >>> now that the split daughters come up on the same region. > >>> > >>> Sorry I don't have a better answer for you. Need to dig in. > >>> > >>> File a JIRA. If you want to help out some, stick some data up in it. > >>> Some suggestions would be to enable logging of when we lookup region > >>> locations in client and then note when requests go to zero. Can you > >>> figure what region the clients are waiting on (if they are waiting on > >>> any). If you can pull out a particular one, try and elicit its > >>> history at time of blockage. Is it being moved or mid-split? I > >>> suppose it makes sense that bigger regions would make the situation > >>> 'worse'. I can take a look at it too. > >>> > >>> St.Ack > >>> > >>> > >>> > >>> > >>> We are constantly loading data to this cluster of 10 nodes. > >>> > These pauses can happen as frequently as every minute but sometimes > are > >>> not > >>> > seen for 15+ minutes. Basically watching the Region server list with > >>> request > >>> > counts is the only evidence of what is going on. All reads and writes > >>> > totally stop and if there is ever any activity it is on the node > hosting > >>> the > >>> > .META. table with a request count of region count + 1. This problem > >>> seems to > >>> > be worse with a larger region size. We tried a 1GB region size and > saw > >>> this > >>> > more than we saw actual activity (and stopped using a larger region > size > >>> > because of it). We went back to the default region size and it was > >>> better, > >>> > but we had too many regions so now we are up to 512M for a region > size > >>> and > >>> > we are seeing it more again. > >>> > > >>> > Does anyone know what this is? We have dug into all of the logs to > find > >>> some > >>> > sort of pause but are not able to find anything. Is this an wal hlog > >>> roll? > >>> > Is this a region split or compaction? Of course our biggest fear is a > GC > >>> > pause on the master but we do not have java logging turned on with > the > >>> > master to tell. What could possibly stop the entire cluster from > working > >>> for > >>> > seconds at a time very frequently? > >>> > > >>> > Thanks in advance for any ideas of what could be causing this. > >>> > > >>> > >> > >> > > >