OK all. Problem seems to be fixed. There was one more empty map file in the .META. region. Deleting it solved the problem and everything appears to be coming back normally.
Thanks all! Danny On Wed, May 21, 2008 at 10:53 AM, stack <[EMAIL PROTECTED]> wrote: > What does a listing of /hbase/ .META./1028785192/info/info/ show Daniel? > You are still getting EOFException trying to open .META. region? (Its > kinda dumb that HDFS doesn't say which file the EOF is coming out of). > > St.Ack > > > > > > Daniel Leffel wrote: > >> Thanks for the response, Jim. >> >> I've tried this and I'm getting the same behavior. Anything else I should >> be >> looking for? >> >> I also found empty files in the compaction directory which I deleted - >> hope >> that was ok. >> >> >> On Wed, May 21, 2008 at 9:29 AM, Jim Kellerman <[EMAIL PROTECTED]> wrote: >> >> >> >>> What is happening is that the region server is trying to open the meta, >>> and >>> failing because there is an empty file somewhere in the region or in the >>> recovery log. I would advise the following: >>> >>> 1. shut down HBase (as cleanly as possible) >>> 2. find and delete any zero length files >>> 3. there may be empty, but not zero length, MapFiles. MapFiles are >>> stored in a directory that looks like: >>> <hbase-root>/<tablename>/<a number>/<columnname>/mapfiles/<a long >>> number>/ >>> >>> In this directory there are two files: 'data' and 'index'. An empty >>> MapFile (one that has been opened and closed with no data written to >>> it) 'data' will have size 110 (bytes) and 'info' will have size 137. >>> >>> If you find one of these, you should delete the directory that contains >>> them (in the example above <a long number>). >>> >>> >>> --- >>> Jim Kellerman, Senior Engineer; Powerset >>> >>> >>> >>> >>>> -----Original Message----- >>>> From: Daniel Leffel [mailto:[EMAIL PROTECTED] >>>> Sent: Wednesday, May 21, 2008 8:13 AM >>>> To: hbase-user@hadoop.apache.org >>>> Subject: Re: Problem caused by HBASE-617 - Blocks missing >>>> from hlog.dat >>>> >>>> After letting the hbase attempt to assign regions all night >>>> long, I awoke to an unassigned meta region. Here is the master log: >>>> >>>> Desc: {name: .META., families: {info:={name: info, max versions: 1, >>>> compression: NONE, in memory: false, max length: 2147483647, >>>> bloom filter: >>>> none}}} >>>> 2008-05-21 11:10:29,560 DEBUG >>>> org.apache.hadoop.hbase.HMaster: Main processing loop: >>>> ProcessRegionClose of .META.,,1, true >>>> 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster: >>>> region closed: >>>> .META.,,1 >>>> 2008-05-21 11:10:29,560 INFO org.apache.hadoop.hbase.HMaster: reassign >>>> region: .META.,,1 >>>> 2008-05-21 11:10:29,729 INFO org.apache.hadoop.hbase.HMaster: >>>> assigning region .META.,,1 to server 10.252.242.159:60020 >>>> 2008-05-21 11:10:32,740 DEBUG >>>> org.apache.hadoop.hbase.HMaster: Received >>>> MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.252.242.159:60020 >>>> 2008-05-21 11:10:32,740 DEBUG >>>> org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE : >>>> .META.,,1 from 10.252.242.159:60020 >>>> 2008-05-21 11:10:32,740 INFO org.apache.hadoop.hbase.HMaster: >>>> 10.252.242.159:60020 no longer serving regionname: .META.,,1, >>>> startKey: <>, >>>> endKey: <>, encodedName: 1028785192, tableDesc: {name: >>>> .META., families: >>>> {info:={name: info, max versions: 1, compression: NONE, in >>>> memory: false, max length: 2147483647, bloom filter: none}}} >>>> 2008-05-21 11:10:32,740 DEBUG >>>> org.apache.hadoop.hbase.HMaster: Main processing loop: >>>> ProcessRegionClose of .META.,,1, true >>>> 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster: >>>> region closed: >>>> .META.,,1 >>>> 2008-05-21 11:10:32,741 INFO org.apache.hadoop.hbase.HMaster: reassign >>>> region: .META.,,1 >>>> 2008-05-21 11:10:34,585 INFO org.apache.hadoop.hbase.HMaster: >>>> assigning region .META.,,1 to server 10.254.30.79:60020 >>>> 2008-05-21 11:10:37,595 DEBUG >>>> org.apache.hadoop.hbase.HMaster: Received >>>> MSG_REPORT_PROCESS_OPEN : .META.,,1 from 10.254.30.79:60020 >>>> 2008-05-21 11:10:37,596 DEBUG >>>> org.apache.hadoop.hbase.HMaster: Received MSG_REPORT_CLOSE : >>>> .META.,,1 from 10.254.30.79:60020 >>>> 2008-05-21 11:10:37,596 INFO org.apache.hadoop.hbase.HMaster: >>>> 10.254.30.79:60020 no longer serving regionname: .META.,,1, >>>> startKey: <>, >>>> endKey: <>, encodedName: 1028785192, tableDesc: {name: >>>> .META., families: >>>> {info:={name: info, max versions: 1, compression: NONE, in >>>> memory: false, max length: 2147483647, bloom filter: none}}} >>>> 2008-05-21 11:10:37,596 DEBUG >>>> org.apache.hadoop.hbase.HMaster: Main processing loop: >>>> ProcessRegionClose of .META.,,1, true >>>> >>>> >>>> and here is the regin server side: >>>> 2008-05-21 11:12:38,027 INFO org.apache.hadoop.hbase.HRegionServer: >>>> MSG_REGION_OPEN : .META.,,1 >>>> 2008-05-21 11:12:38,027 DEBUG >>>> org.apache.hadoop.hbase.HRegion: Opening region .META.,,1/1028785192 >>>> 2008-05-21 11:12:38,035 DEBUG org.apache.hadoop.hbase.HStore: >>>> loaded >>>> /hbase/.META./1028785192/info/info/2509658022189995817, >>>> isReference=false >>>> 2008-05-21 11:12:38,036 DEBUG org.apache.hadoop.hbase.HStore: >>>> loaded >>>> /hbase/.META./1028785192/info/info/8183182393002383771, >>>> isReference=false >>>> 2008-05-21 11:12:38,049 DEBUG org.apache.hadoop.hbase.HStore: Loaded 2 >>>> file(s) in hstore 1028785192/info, max sequence id 917793587 >>>> 2008-05-21 11:12:38,116 ERROR >>>> org.apache.hadoop.hbase.HRegionServer: error opening region >>>> .META.,,1 java.io.EOFException >>>> at java.io.DataInputStream.readFully(DataInputStream.java:180) >>>> at java.io.DataInputStream.readFully(DataInputStream.java:152) >>>> at >>>> org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1434) >>>> at >>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j >>>> ava:1411) >>>> at >>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j >>>> ava:1400) >>>> at >>>> org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.j >>>> ava:1395) >>>> at >>>> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:263) >>>> at >>>> org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:242) >>>> at >>>> org.apache.hadoop.hbase.HStoreFile$HbaseMapFile$HbaseReader.<i >>>> >>>> >>> nit>(HStoreFile.java:554) >>> >>> >>>> at >>>> org.apache.hadoop.hbase.HStoreFile$BloomFilterMapFile$Reader.< >>>> >>>> >>> init>(HStoreFile.java:609) >>> >>> >>>> at >>>> org.apache.hadoop.hbase.HStoreFile.getReader(HStoreFile.java:382) >>>> at org.apache.hadoop.hbase.HStore.<init>(HStore.java:849) >>>> at org.apache.hadoop.hbase.HRegion.<init>(HRegion.java:431) >>>> at >>>> org.apache.hadoop.hbase.HRegionServer.openRegion(HRegionServer >>>> .java:1258) >>>> at >>>> org.apache.hadoop.hbase.HRegionServer$Worker.run(HRegionServer >>>> .java:1204) >>>> at java.lang.Thread.run(Thread.java:619) >>>> >>>> >>>> >>>> On Tue, May 20, 2008 at 4:19 PM, stack <[EMAIL PROTECTED]> wrote: >>>> >>>> >>>> >>>>> Daniel Leffel wrote: >>>>> >>>>> >>>>> >>>>>> After experiencing a region server that would not exit >>>>>> >>>>>> >>>>> (HBASE-617), I >>>> >>>> >>>>> tried to bring back up hbase (after first having shut down and >>>>>> bringing back up DFS). >>>>>> >>>>>> There are around 370 regions. The first 250 were assigned >>>>>> >>>>>> >>>>> to region >>>> >>>> >>>>> servers within 5 minutes of startup. The rest of the >>>>>> >>>>>> >>>>> regions took the >>>> >>>> >>>>> better part of the day to become assigned to a region >>>>>> >>>>>> >>>>> server. A quick >>>> >>>> >>>>> inspection of the regionserver logs were showing messages like the >>>>>> following: >>>>>> >>>>>> 2008-05-20 18:33:46,964 DEBUG org.apache.hadoop.hbase.HMaster: >>>>>> Received MSG_REPORT_PROCESS_OPEN : >>>>>> >>>>>> >>>>> categories,2864153,1211005494348 >>>> >>>> >>>>> from 10.254.26.31:60020 >>>>>> >>>>>> >>>>>> >>>>>> >>>>> These messages are sent over to the master by the regionserver as a >>>>> kind of ping saying "I'm still alive and working on >>>>> >>>>> >>>> whatever it was you gave me". >>>> >>>> >>>>> Can you tell what was happening by looking in regionserver logs? >>>>> >>>>> Was it that all regions had been given to a single >>>>> >>>>> >>>> regionserver and it >>>> >>>> >>>>> was busy replaying edits before bringing the regions online >>>>> >>>>> >>>> (There is >>>> >>>> >>>>> a single worker thread per regionserver. If lots of edits >>>>> >>>>> >>>> to replay, >>>> >>>> >>>>> can take seconds to minutes to bring on a region). >>>>> >>>>> Did the regions come online gradually or all in a lump? >>>>> >>>>> >>>>> >>>>>> After waiting for all the regions to be assigned (and an >>>>>> >>>>>> >>>>> absence of >>>> >>>> >>>>> the above message appearing in the log), I started a MapReduce job >>>>>> that iterates over all regions. Immediately, the above mentioned >>>>>> region began to show up in the logs again with the above >>>>>> >>>>>> >>>>> message and >>>> >>>> >>>>> the job failed with an IOException because it couldn't >>>>>> >>>>>> >>>>> locate blocks. >>>> >>>> >>>>> I ran fsck on /hbase and sure enough, blocks are missing from the >>>>>> following file (although it reports a size of 0 as what's >>>>>> >>>>>> >>>>> missing - I >>>> >>>> >>>>> presume it just doesn't know): >>>>>> >>>>>> /hbase/log_10.254.30.79_1211300015031_60020/hlog.dat.000 >>>>>> >>>>>> >>>>>> >>>>>> >>>>> The above looks like the innocuous messages described in >>>>> https://issues.apache.org/jira/browse/HBASE-509. >>>>> >>>>> St.Ack >>>>> >>>>> >>>>> >>>>> >>>> No virus found in this incoming message. >>>> Checked by AVG. >>>> Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release >>>> Date: 5/20/2008 6:45 AM >>>> >>>> >>>> >>> No virus found in this outgoing message. >>> Checked by AVG. >>> Version: 8.0.100 / Virus Database: 269.23.21/1456 - Release Date: >>> 5/20/2008 >>> 6:45 AM >>> >>> >>> >> >> >> > >