Thank you for the lead! We will definitely look closer at the OS logs. On Thu, Jan 13, 2011 at 6:59 AM, Tatsuya Kawano <tatsuya6...@gmail.com>wrote:
> > Hi Wayne, > > > We are seeing some TCP Resets on all nodes at the same time, and > sometimes > > quite a lot of them. > > > Have you checked this article from Andrei and Cosmin? They had a busy > firewall to cause network blackout. > > http://hstack.org/hbase-performance-testing/ > > Maybe it's not your case but just for sure. > > Thanks, > > -- > Tatsuya Kawano (Mr.) > Tokyo, Japan > > > On Jan 13, 2011, at 4:52 AM, Wayne <wav...@gmail.com> wrote: > > > We are seeing some TCP Resets on all nodes at the same time, and > sometimes > > quite a lot of them. We have yet to correlate the pauses to the TCP > resets > > but I am starting to wonder if this is partly a network problem. Does > > Gigabit Ethernet break down on high volume nodes? Do high volume nodes > use > > 10G or Infiniband? > > > > > > On Wed, Jan 12, 2011 at 1:52 PM, Stack <st...@duboce.net> wrote: > > > >> Jon asks that you describe your loading in the issue. Would you mind > >> doing so. Ted, stick up in the issue the workload and configs. you > >> are running if you don't mind. I'd like to try it over here. > >> Thanks lads, > >> St.Ack > >> > >> > >> On Wed, Jan 12, 2011 at 9:03 AM, Wayne <wav...@gmail.com> wrote: > >>> Added: https://issues.apache.org/jira/browse/HBASE-3438. > >>> > >>> On Wed, Jan 12, 2011 at 11:40 AM, Wayne <wav...@gmail.com> wrote: > >>> > >>>> We are using 0.89.20100924, r1001068 > >>>> > >>>> We are seeing see it during heavy write load (which is all the time), > >> but > >>>> yesterday we had read load as well as write load and saw both reads > and > >>>> writes stop for 10+ seconds. The region size is the biggest clue we > have > >>>> found from our tests as setting up a new cluster with a 1GB max region > >> size > >>>> and starting to load heavily we will see this a lot for long long time > >>>> frames. Maybe the bigger file gets hung up more easily with a split? > >> Your > >>>> description below also fits in that early on the load is not balanced > so > >> it > >>>> is easier to stop everything on one node as the balance is not great > >> early > >>>> on. I will file a JIRA. I will also try to dig deeper into the logs > >> during > >>>> the pauses to find a node that might be stuck in a split. > >>>> > >>>> > >>>> > >>>> On Wed, Jan 12, 2011 at 11:17 AM, Stack <st...@duboce.net> wrote: > >>>> > >>>>> On Tue, Jan 11, 2011 at 2:34 PM, Wayne <wav...@gmail.com> wrote: > >>>>>> We have very frequent cluster wide pauses that stop all reads and > >>>>> writes > >>>>>> for seconds. > >>>>> > >>>>> All reads and all writes? > >>>>> > >>>>> I've seen the pause too for writes. Its something I've always meant > >>>>> to look into. Friso postulates one cause. Another that we've talked > >>>>> of is a region taking a while to come back on line after a split or a > >>>>> rebalance for whatever reason. Client loading might be 'random' > >>>>> spraying over lots of random regions but they all get stuck waiting > on > >>>>> one particular region to come back online. > >>>>> > >>>>> I suppose reads could be blocked for same reason if all are trying to > >>>>> read from the offlined region. > >>>>> > >>>>> What version of hbase are you using? Splits should be faster in 0.90 > >>>>> now that the split daughters come up on the same region. > >>>>> > >>>>> Sorry I don't have a better answer for you. Need to dig in. > >>>>> > >>>>> File a JIRA. If you want to help out some, stick some data up in it. > >>>>> Some suggestions would be to enable logging of when we lookup region > >>>>> locations in client and then note when requests go to zero. Can you > >>>>> figure what region the clients are waiting on (if they are waiting on > >>>>> any). If you can pull out a particular one, try and elicit its > >>>>> history at time of blockage. Is it being moved or mid-split? I > >>>>> suppose it makes sense that bigger regions would make the situation > >>>>> 'worse'. I can take a look at it too. > >>>>> > >>>>> St.Ack > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> We are constantly loading data to this cluster of 10 nodes. > >>>>>> These pauses can happen as frequently as every minute but sometimes > >> are > >>>>> not > >>>>>> seen for 15+ minutes. Basically watching the Region server list with > >>>>> request > >>>>>> counts is the only evidence of what is going on. All reads and > writes > >>>>>> totally stop and if there is ever any activity it is on the node > >> hosting > >>>>> the > >>>>>> .META. table with a request count of region count + 1. This problem > >>>>> seems to > >>>>>> be worse with a larger region size. We tried a 1GB region size and > >> saw > >>>>> this > >>>>>> more than we saw actual activity (and stopped using a larger region > >> size > >>>>>> because of it). We went back to the default region size and it was > >>>>> better, > >>>>>> but we had too many regions so now we are up to 512M for a region > >> size > >>>>> and > >>>>>> we are seeing it more again. > >>>>>> > >>>>>> Does anyone know what this is? We have dug into all of the logs to > >> find > >>>>> some > >>>>>> sort of pause but are not able to find anything. Is this an wal hlog > >>>>> roll? > >>>>>> Is this a region split or compaction? Of course our biggest fear is > a > >> GC > >>>>>> pause on the master but we do not have java logging turned on with > >> the > >>>>>> master to tell. What could possibly stop the entire cluster from > >> working > >>>>> for > >>>>>> seconds at a time very frequently? > >>>>>> > >>>>>> Thanks in advance for any ideas of what could be causing this. > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > >