@Stack, we tried your suggestion for getting off the ground with an extra RS. We added 1 more identical RS, and after balancing, killed the extra one. The cluster remained stable for the night, but this morning all 3 of our RSs had OOMs.
In the logs we find many entries like https://gist.github.com/eadb953fcadbeb302143 Followed by the RSs aborting due to OOMs. Could this maybe be subject to HBASE-4222? Thanks for your help! On Fri, Dec 16, 2011 at 3:31 PM, Homer Strong <[email protected]> wrote: > Thanks for the response! To add to our problem's description: it > doesn't seem like an absolute number of regions that triggers the > memory overuse, we've seen it happen now with a wide range of region > counts. > >> Just opening regions, it does this? > Yes. > >> No load? > Very low load, no requests. > >> No swapping? > Swapping is disabled. > > >> Bring up more xlarge instances and see if gets you off the ground? >> Then work on getting your number of regions down in number? > We'll try this and get back in a couple minutes! > > > > On Fri, Dec 16, 2011 at 3:21 PM, Stack <[email protected]> wrote: >> On Fri, Dec 16, 2011 at 1:57 PM, Homer Strong <[email protected]> wrote: >>> Whenever a RS is assigned a large (> 500-600) number of regions, the >>> heap usage grows without bound. Then the RS constantly GCs and must be >>> killed. >>> >> >> Just opening regions, it does this? >> >> No load? >> >> No swapping? >> >> What JVM and what args for JVM? >> >> >>> This is with 2000 regions over 3 RSs, with 10 GB heap. RSs have EC2 >>> xlarges. Master is on its own large. Datanodes and namenodes are >>> adjacent to RSs and master, respectively. >>> >>> Looks like a memory leak? Any suggestions would be appreciated. >>> >> >> Bring up more xlarge instances and see if gets you off the ground? >> Then work on getting your number of regions down in number? >> >> St.Ack
