[ https://issues.apache.org/jira/browse/HBASE-4095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13081438#comment-13081438 ]
ramkrishna.s.vasudevan commented on HBASE-4095: ----------------------------------------------- Patch looks good. > Hlog may not be rolled in a long time if checkLowReplication's request of > LogRoll is blocked > -------------------------------------------------------------------------------------------- > > Key: HBASE-4095 > URL: https://issues.apache.org/jira/browse/HBASE-4095 > Project: HBase > Issue Type: Bug > Components: regionserver > Affects Versions: 0.90.3 > Reporter: Jieshan Bean > Assignee: Jieshan Bean > Fix For: 0.90.4 > > Attachments: HBASE-4095-90-v2.patch, HBASE-4095-90.patch, > HBASE-4095-trunk-v2.patch, HBASE-4095-trunk.patch, > HBase-4095-V4-Branch.patch, HBase-4095-V5-Branch.patch, > HBase-4095-V5-trunk.patch, HBase-4095-V6-branch.patch, > HBase-4095-V6-trunk.patch, HlogFileIsVeryLarge.gif, > RelatedLogs2011-07-28.txt, TestResultForPatch-V4.rar, > hbase-root-regionserver-193-195-5-111.rar, surefire-report-V5-trunk.html, > surefire-report-branch.html > > > Some large Hlog files(Larger than 10G) appeared in our environment, and I got > the reason why they got so huge: > 1. The replicas is less than the expect number. So the method of > checkLowReplication will be called each sync. > 2. The method checkLowReplication request log-roll first, and set > logRollRequested as true: > {noformat} > private void checkLowReplication() { > // if the number of replicas in HDFS has fallen below the initial > // value, then roll logs. > try { > int numCurrentReplicas = getLogReplication(); > if (numCurrentReplicas != 0 && > numCurrentReplicas < this.initialReplication) { > LOG.warn("HDFS pipeline error detected. " + > "Found " + numCurrentReplicas + " replicas but expecting " + > this.initialReplication + " replicas. " + > " Requesting close of hlog."); > requestLogRoll(); > logRollRequested = true; > } > } catch (Exception e) { > LOG.warn("Unable to invoke DFSOutputStream.getNumCurrentReplicas" + e + > " still proceeding ahead..."); > } > } > {noformat} > 3.requestLogRoll() just commit the roll request. It may not execute in time, > for it must got the un-fair lock of cacheFlushLock. > But the lock may be carried by the cacheflush threads. > 4.logRollRequested was true until the log-roll executed. So during the time, > each request of log-roll in sync() was skipped. > Here's the logs while the problem happened(Please notice the file size of > hlog "193-195-5-111%3A20020.1309937386639" in the last row): > 2011-07-06 15:28:59,284 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: > HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas. > Requesting close of hlog. > 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Roll > /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, > entries=32434, filesize=239589754. New hlog > /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639 > 2011-07-06 15:29:56,929 WARN org.apache.hadoop.hbase.regionserver.wal.HLog: > HDFS pipeline error detected. Found 2 replicas but expecting 3 replicas. > Requesting close of hlog. > 2011-07-06 15:29:56,933 INFO org.apache.hadoop.hbase.regionserver.Store: > Renaming flushed file at > hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/.tmp/4656903854447026847 > to > hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983 > 2011-07-06 15:29:57,391 INFO org.apache.hadoop.hbase.regionserver.Store: > Added > hdfs://193.195.5.112:9000/hbase/Htable_UFDR_034/a3780cf0c909d8cf8f8ed618b290cc95/value/8603005630220380983, > entries=445880, sequenceid=248900, memsize=207.5m, filesize=130.1m > 2011-07-06 15:29:57,478 INFO org.apache.hadoop.hbase.regionserver.HRegion: > Finished memstore flush of ~207.5m for region > Htable_UFDR_034,07664,1309936974158.a3780cf0c909d8cf8f8ed618b290cc95. in > 10839ms, sequenceid=248900, compaction requested=false > 2011-07-06 15:28:59,236 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Roll > /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309926531955, > entries=216459, filesize=2370387468. New hlog > /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119 > 2011-07-06 15:29:46,714 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Roll > /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937339119, > entries=32434, filesize=239589754. New hlog > /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639 > 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: > Hlog roll period 3600000ms elapsed > 2011-07-06 16:29:58,775 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: > Hlog roll period 3600000ms elapsed > 2011-07-06 16:30:01,978 INFO org.apache.hadoop.hbase.regionserver.wal.HLog: > Roll > /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309937386639, > entries=1135576, filesize=19220372830. New hlog > /hbase/.logs/193-195-5-111,20020,1309922880081/193-195-5-111%3A20020.1309940998890 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira