Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Ted Dunning
To amplify on this, the collections before tenuring are copying collections and thus immune to fragmentation. CMS, however, stands for Concurrent Mark Sweep and mark-sweep collectors are subject to fragmentation. When fragmentation causes a failure to allocate, then you get a full copying GC. If

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Stack
On Thu, Jan 27, 2011 at 11:38 AM, Wayne wrote: > Should we schedule a rolling restart every 24 hours? You could do this (See J-Ds note). You could try a different JVM (Todd tripped over this beauty recently that would seem to have fixes added in u20 to address fragmentation made things worse: ht

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Jean-Daniel Cryans
Doing it from the master is what we aim to do, but it's a lot more involved. I know that jgray has a few ideas on the subject. J-D On Thu, Jan 27, 2011 at 4:25 PM, Ted Dunning wrote: > Why doesn't the master do this?  Why not just set it up so that you can tell > the master that the maximum numb

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Ted Dunning
Why doesn't the master do this? Why not just set it up so that you can tell the master that the maximum number of regions for the soon-to-go region server is 0? On Thu, Jan 27, 2011 at 3:53 PM, Jean-Daniel Cryans wrote: > Writing this gives me an idea... I think one "easy" way we could > achieve

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Jean-Daniel Cryans
Not as far as I know, unless you disabled splits from the beginning like some ppl do. J-D On Thu, Jan 27, 2011 at 4:22 PM, Ted Yu wrote: > Is there a way to disable splitting (on a particular region server) ? > > On Thu, Jan 27, 2011 at 4:20 PM, Jean-Daniel Cryans > wrote: > >> Mmm yes for the

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Ted Yu
Is there a way to disable splitting (on a particular region server) ? On Thu, Jan 27, 2011 at 4:20 PM, Jean-Daniel Cryans wrote: > Mmm yes for the sake of not having a single region that moved, but it > wouldn't be so bad... it just means that those regions will be closed > when the RS closes. >

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Jean-Daniel Cryans
Mmm yes for the sake of not having a single region that moved, but it wouldn't be so bad... it just means that those regions will be closed when the RS closes. Also it's possible to have splits during that time, again it's not dramatic as long as the script doesn't freak out because a region is go

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Ted Yu
Should steps 1 and 2 below be exchanged ? Regards On Thu, Jan 27, 2011 at 3:53 PM, Jean-Daniel Cryans wrote: > To mitigate heap fragmentation, you could consider adding more nodes > to the cluster :) > > Regarding rolling restarts, currently there's one major issue: > https://issues.apache.org/j

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Jean-Daniel Cryans
To mitigate heap fragmentation, you could consider adding more nodes to the cluster :) Regarding rolling restarts, currently there's one major issue: https://issues.apache.org/jira/browse/HBASE-3441 How it currently works is a bit dumb, when you cleanly close a region server it will first close a

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Wayne
I assumed GC was *trying* to roll. It shows the last 30min of logs with control characters at the end. We are not all writes. In terms of writes we can wait and the zookeeper timeout can go way up, but we also need to support real-time reads (end user based) and that is why the zookeeper timeout i

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Stack
On Thu, Jan 27, 2011 at 6:48 AM, Wayne wrote: > We have got .90 up and running well, but again after 24 hours of loading a > node went down. Under it all I assume it is a GC issue, but the GC logging > rolls every < 60 minutes so I can never see logs from 5 hours ago (working > on getting Scribe u

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Wayne
I only see the bad datanode error on the one node right before zookeeper brought it down. On Thu, Jan 27, 2011 at 10:53 AM, Ted Yu wrote: > About bad datanode error, I found 164 occurrences in 7-node dev cluster > hbase 0.90 region server logs. > In our 14 node staging cluster running hbase 0.20

Re: SocketTimeoutException caused by GC?

2011-01-27 Thread Ted Yu
About bad datanode error, I found 164 occurrences in 7-node dev cluster hbase 0.90 region server logs. In our 14 node staging cluster running hbase 0.20.6, I found none. Both use cdh3b2 hadoop. On Thu, Jan 27, 2011 at 6:48 AM, Wayne wrote: > We have got .90 up and running well, but again after

SocketTimeoutException caused by GC?

2011-01-27 Thread Wayne
We have got .90 up and running well, but again after 24 hours of loading a node went down. Under it all I assume it is a GC issue, but the GC logging rolls every < 60 minutes so I can never see logs from 5 hours ago (working on getting Scribe up to solve that). Most of our issues are a node being m