I'd advise setting the upper limit for WALs back down to 32 rather than the 96 you have. Lets figure why old logs are not being cleared up even if only 32. When 96, it means that on crash, the log splitting process has more logs to process (~96 rather than ~32). It'll take longer for the split process to run and therefore longer for the regions to come back on line.
Is this the state of things across all regionservers or just one or two? As J-D asks, your loading profile, how many regions per regionserver would be of interest. Next up would be your putting up a regionserver log that we could pull and look at. We'd check the edit sequence numbers to figure why we're not letting logs go. Thanks Kevin, St.Ack On Tue, Dec 15, 2009 at 10:34 AM, Kevin Peterson <kevin...@gmail.com> wrote: > We're running a 13 node HBase cluster. We had some problems a week ago with > it being overloaded and errors related to not being able to find a block on > HDFS, but adding four more nodes and increasing max heap from 3GB to 4.5GB > on all nodes fixed any problems. > > Looking at the logs now, though, we see that HLogs are not getting removed: > > 2009-12-15 01:45:48,426 INFO org.apache.hadoop.hbase.regionserver.HLog: > Roll > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260867136036, > entries=210524, calcsize=63757422, filesize=41073798. New hlog > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260870348421 > 2009-12-15 01:45:48,427 INFO org.apache.hadoop.hbase.regionserver.HLog: Too > many hlogs: logs=130, maxlogs=96; forcing flush of region with oldest > edits: > articles-article-id,f15489ea-38a4-4127-9179-1b2dc5f3b5d4,1260083783909 > 2009-12-15 01:57:14,188 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Starting compaction on region > > articles,\x00\x00\x01\x25\x8C\x0F\xCB\x18\xB5U\xF7\xC6\x5DoH\xB8\x98\xEBH,E\x7C\x07\x14,1260830133341 > 2009-12-15 01:57:17,519 INFO org.apache.hadoop.hbase.regionserver.HLog: > Roll > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260870348421, > entries=92795, calcsize=63908073, filesize=54042783. New hlog > /hbase/.logs/mi-prod-app33,60020,1260495617070/hlog.dat.1260871037510 > 2009-12-15 01:57:17,519 INFO org.apache.hadoop.hbase.regionserver.HLog: Too > many hlogs: logs=131, maxlogs=96; forcing flush of region with oldest > edits: > articles-article-id,f1cd1b02-3d1b-453c-b44f-94ec5a1e3a46,1260007536878 > > From reading the log message, I interpret this as saying that every time it > rolls an hlog, if there are more than maxlogs logs, it will flush one > region. I'm assuming that a log could have edits for multiple regions, so > this seems to mean that if we have 100 regions and maxlogs set to 96, if it > flushes one region each time it rolls a log, it will create 100 logs before > it flushes all regions and is able to delete the log, so it will reach > steady state at 196 hlogs. Is this correct? > > We're concerned because when we had problems last week, we saw lots of log > messages related to "Too many hlogs" and had assumed they were related to > the problems. Is this anything to worry about? >