Thank you for the lead! We will definitely look closer at the OS logs.

On Thu, Jan 13, 2011 at 6:59 AM, Tatsuya Kawano <tatsuya6...@gmail.com>wrote:

>
> Hi Wayne,
>
> > We are seeing some TCP Resets on all nodes at the same time, and
> sometimes
> > quite a lot of them.
>
>
> Have you checked this article from Andrei and Cosmin? They had a busy
> firewall to cause network blackout.
>
> http://hstack.org/hbase-performance-testing/
>
> Maybe it's not your case but just for sure.
>
> Thanks,
>
> --
> Tatsuya Kawano (Mr.)
> Tokyo, Japan
>
>
> On Jan 13, 2011, at 4:52 AM, Wayne <wav...@gmail.com> wrote:
>
> > We are seeing some TCP Resets on all nodes at the same time, and
> sometimes
> > quite a lot of them. We have yet to correlate the pauses to the TCP
> resets
> > but I am starting to wonder if this is partly a network problem. Does
> > Gigabit Ethernet break down on high volume nodes? Do high volume nodes
> use
> > 10G or Infiniband?
> >
> >
> > On Wed, Jan 12, 2011 at 1:52 PM, Stack <st...@duboce.net> wrote:
> >
> >> Jon asks that you describe your loading in the issue.  Would you mind
> >> doing so.  Ted, stick up in the issue the workload and configs. you
> >> are running if you don't mind.  I'd like to try it over here.
> >> Thanks lads,
> >> St.Ack
> >>
> >>
> >> On Wed, Jan 12, 2011 at 9:03 AM, Wayne <wav...@gmail.com> wrote:
> >>> Added: https://issues.apache.org/jira/browse/HBASE-3438.
> >>>
> >>> On Wed, Jan 12, 2011 at 11:40 AM, Wayne <wav...@gmail.com> wrote:
> >>>
> >>>> We are using 0.89.20100924, r1001068
> >>>>
> >>>> We are seeing see it during heavy write load (which is all the time),
> >> but
> >>>> yesterday we had read load as well as write load and saw both reads
> and
> >>>> writes stop for 10+ seconds. The region size is the biggest clue we
> have
> >>>> found from our tests as setting up a new cluster with a 1GB max region
> >> size
> >>>> and starting to load heavily we will see this a lot for long long time
> >>>> frames. Maybe the bigger file gets hung up more easily with a split?
> >> Your
> >>>> description below also fits in that early on the load is not balanced
> so
> >> it
> >>>> is easier to stop everything on one node as the balance is not great
> >> early
> >>>> on. I will file a JIRA. I will also try to dig deeper into the logs
> >> during
> >>>> the pauses to find a node that might be stuck in a split.
> >>>>
> >>>>
> >>>>
> >>>> On Wed, Jan 12, 2011 at 11:17 AM, Stack <st...@duboce.net> wrote:
> >>>>
> >>>>> On Tue, Jan 11, 2011 at 2:34 PM, Wayne <wav...@gmail.com> wrote:
> >>>>>> We have very frequent cluster wide pauses that stop all reads and
> >>>>> writes
> >>>>>> for seconds.
> >>>>>
> >>>>> All reads and all writes?
> >>>>>
> >>>>> I've seen the pause too for writes.  Its something I've always meant
> >>>>> to look into.  Friso postulates one cause.  Another that we've talked
> >>>>> of is a region taking a while to come back on line after a split or a
> >>>>> rebalance for whatever reason.  Client loading might be 'random'
> >>>>> spraying over lots of random regions but they all get stuck waiting
> on
> >>>>> one particular region to come back online.
> >>>>>
> >>>>> I suppose reads could be blocked for same reason if all are trying to
> >>>>> read from the offlined region.
> >>>>>
> >>>>> What version of hbase are you using?  Splits should be faster in 0.90
> >>>>> now that the split daughters come up on the same region.
> >>>>>
> >>>>> Sorry I don't have a better answer for you.  Need to dig in.
> >>>>>
> >>>>> File a JIRA.  If you want to help out some, stick some data up in it.
> >>>>> Some suggestions would be to enable logging of when we lookup region
> >>>>> locations in client and then note when requests go to zero.  Can you
> >>>>> figure what region the clients are waiting on (if they are waiting on
> >>>>> any).  If you can pull out a particular one, try and elicit its
> >>>>> history at time of blockage.  Is it being moved or mid-split?  I
> >>>>> suppose it makes sense that bigger regions would make the situation
> >>>>> 'worse'.  I can take a look at it too.
> >>>>>
> >>>>> St.Ack
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> We are constantly loading data to this cluster of 10 nodes.
> >>>>>> These pauses can happen as frequently as every minute but sometimes
> >> are
> >>>>> not
> >>>>>> seen for 15+ minutes. Basically watching the Region server list with
> >>>>> request
> >>>>>> counts is the only evidence of what is going on. All reads and
> writes
> >>>>>> totally stop and if there is ever any activity it is on the node
> >> hosting
> >>>>> the
> >>>>>> .META. table with a request count of region count + 1. This problem
> >>>>> seems to
> >>>>>> be worse with a larger region size. We tried a 1GB region size and
> >> saw
> >>>>> this
> >>>>>> more than we saw actual activity (and stopped using a larger region
> >> size
> >>>>>> because of it). We went back to the default region size and it was
> >>>>> better,
> >>>>>> but we had too many regions so now we are up to 512M for a region
> >> size
> >>>>> and
> >>>>>> we are seeing it more again.
> >>>>>>
> >>>>>> Does anyone know what this is? We have dug into all of the logs to
> >> find
> >>>>> some
> >>>>>> sort of pause but are not able to find anything. Is this an wal hlog
> >>>>> roll?
> >>>>>> Is this a region split or compaction? Of course our biggest fear is
> a
> >> GC
> >>>>>> pause on the master but we do not have java logging turned on with
> >> the
> >>>>>> master to tell. What could possibly stop the entire cluster from
> >> working
> >>>>> for
> >>>>>> seconds at a time very frequently?
> >>>>>>
> >>>>>> Thanks in advance for any ideas of what could be causing this.
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
>
>

Reply via email to