Thanks for the understanding. Can you log a JIRA and put your ideas below in it ?
On Jul 4, 2011, at 12:42 AM, Eran Kutner <e...@gigya.com> wrote: > Thanks for the explanation Ted, > > I will try to apply HBASE-3789 and hope for the best but my understanding is > that it doesn't really solve the problem, it only reduces the probability of > it happening, at least in one particular scenario. I would hope for a more > robust solution. > My concern is that the region allocation process seems to rely too much on > timing considerations and doesn't seem to take enough measures to guarantee > conflicts do not occur. I understand that in a distributed environment, when > you don't get a timely response from a remote machine you can't know for > sure if it did or did not receive the request, however there are things that > can be done to mitigate this and reduce the conflict time significantly. For > example, when I run dbck it knows that some regions are multiply assigned, > the master could do the same and try to resolve the conflict. Another > approach would be to handle late responses, even if the response from the > remote machine arrives after it was assumed to be dead the master should > have enough information to know it had created a conflict by assigning the > region to another server. An even better solution, I think, is for the RS to > periodically test that it is indeed the rightful owner of every region it > holds and relinquish control over the region if it's not. > Obviously a state where two RSs hold the same region is pathological and can > lead to data loss, as demonstrated in my case. The system should be able to > actively protect itself against such a scenario. It probably doesn't need > saying but there is really nothing worse for a data storage system than data > loss. > > In my case the problem didn't happen in the initial phase but after > disabling and enabling a table with about 12K regions. > > -eran > > > > On Sun, Jul 3, 2011 at 23:49, Ted Yu <yuzhih...@gmail.com> wrote: > >> Let me try to answer some of your questions. >> The two paragraphs below were written along my reasoning which is in >> reverse >> order of the actual call sequence. >> >> For #4 below, the log indicates that the following was executed: >> private void assign(final RegionState state, final boolean setOfflineInZK, >> final boolean forceNewPlan) { >> for (int i = 0; i < this.maximumAssignmentAttempts; i++) { >> if (setOfflineInZK && !*setOfflineInZooKeeper*(state)) return; >> >> The above was due to the timeout which you noted in #2 which would have >> caused >> TimeoutMonitor.chore() to run this code (line 1787) >> >> for (Map.Entry<HRegionInfo, Boolean> e: assigns.entrySet()){ >> assign(e.getKey(), false, e.getValue()); >> } >> >> This means there is lack of coordination between >> assignmentManager.TimeoutMonitor and OpenedRegionHandler >> >> The reason I mention HBASE-3789 is that it is marked as Incompatible change >> and is in TRUNK already. >> The application of HBASE-3789 to 0.90 branch would change the behavior >> (timing) of region assignment. >> >> I think it makes sense to evaluate the effect of HBASE-3789 in 0.90.4 >> >> BTW were the incorrect region assignments observed for a table with >> multiple >> initial regions ? >> If so, I have HBASE-4010 in TRUNK which speeds up initial region assignment >> by about 50%. >> >> Cheers >> >> On Sun, Jul 3, 2011 at 12:02 PM, Eran Kutner <e...@gigya.com> wrote: >> >>> Ted, >>> So if I understand correctly the the theory is that because of the issue >>> fixed in HBASE-3789 the master took too long to detect that the region >> was >>> successfully opened by the first server so it forced closed it and >>> transitioned to a second server, but there are a few things about this >>> scenario I don't understand, probably because I don't know enough about >> the >>> inner workings of the region transition process and would appreciate it >> if >>> you can help me understand: >>> 1. The RS opened the region at 16:37:49. >>> 2. The master started handling the opened event at 16:39:54 - this delay >>> can >>> probably be explained by HBASE-3789 >>> 3. At 16:39:54 the master log says: Opened region gs_raw_events,..... on >>> hadoop1-s05.farm-ny.gigya.com >>> 4. Then at 16:40:00 the master log says: master:60000-0x13004a31d7804c4 >>> Creating (or updating) unassigned node for 584dac5cc70d8682f71c4675a843c3 >>> 09 with OFFLINE state - why did it decide to take the region offline >> after >>> learning it was successfully opened? >>> 5. Then it tries to reopen the region on hadoop1-s05, which indicates in >>> its >>> log that the open request failed because the region was already open - >> why >>> didn't the master use that information to learn that the region was >> already >>> open? >>> 6. At 16:43:57 the master decides the region transition timed out and >>> starts >>> forcing the transition - HBASE-3789 again? >>> 7. Now the master forces the transition of the region to hadoop1-s02 but >>> there is no sign of that on hadoop1-s05 - why doesn't the old RS >>> (hadoop1-s05) detect that it is no longer the master and relinquishes >>> control of the region? >>> >>> Thanks. >>> >>> -eran >>> >>> >>> >>> On Sun, Jul 3, 2011 at 20:09, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>>> HBASE-3789 should have sped up region assignment. >>>> The patch for 0.90 is attached to that JIRA. >>>> >>>> You may prudently apply that patch. >>>> >>>> Regards >>>> >>>> On Sun, Jul 3, 2011 at 10:01 AM, Eran Kutner <e...@gigya.com> wrote: >>>> >>>>> Thanks Ted, but, as stated before, I'm already using 0.90.3, so >> either >>>> it's >>>>> not fixed or it's not the same thing. >>>>> >>>>> -eran >>>>> >>>>> >>>>> >>>>> On Sun, Jul 3, 2011 at 17:27, Ted Yu <yuzhih...@gmail.com> wrote: >>>>> >>>>>> Eran: >>>>>> I was thinking of this: >>>>>> HBASE-3789 Cleanup the locking contention in the master >>>>>> >>>>>> though it doesn't directly handle 'PENDING_OPEN for too long' case. >>>>>> >>>>>> https://issues.apache.org/jira/browse/HBASE-3741 is in 0.90.3 and >>>>> actually >>>>>> close to the symptom you described. >>>>>> >>>>>> On Sun, Jul 3, 2011 at 12:00 AM, Eran Kutner <e...@gigya.com> >> wrote: >>>>>> >>>>>>> It does seem that both servers opened the same region around the >>> same >>>>>> time. >>>>>>> The region was offline because I disabled the table so I can >> change >>>> its >>>>>>> TTL. >>>>>>> >>>>>>> Here is the log from haddop1-s05 : >>>>>>> 2011-06-29 16:37:12,576 INFO >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received >>> request >>>> to >>>>>>> open >>>>>>> region: >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> 2011-06-29 16:37:12,680 DEBUG >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >>>>>> Processing >>>>>>> open of >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> 2011-06-29 16:37:12,680 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to >>>>>>> RS_ZK_REGION_OPENING >>>>>>> 2011-06-29 16:37:12,711 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to >>>>>>> RS_ZK_REGION_OPENING >>>>>>> 2011-06-29 16:37:12,711 DEBUG >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >>>>>>> Opening region: REGION => {NAME => >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.', >>>>>>> STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY => >>>>>>> 'GSLoad_1308518810_1249_WEB204', ENCODED => >>>>>>> 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME => >>> 'gs_raw_events', >>>>>>> FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE', >>>>> REPLICATION_SCOPE >>>>>> => >>>>>>> '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800', >>>> BLOCKSIZE >>>>> => >>>>>>> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} >>>>>>> 2011-06-29 16:37:12,711 DEBUG >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >>>>>>> Instantiated >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> 2011-06-29 16:37:12,847 DEBUG >>>>> org.apache.hadoop.hbase.regionserver.Store: >>>>>>> loaded >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, >>>>>>> isReference=false, isBulkLoadResult=false, seqid=1162228062, >>>>>>> majorCompaction=false >>>>>>> 2011-06-29 16:37:12,848 INFO >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >>>>>>> Onlined >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.; >>>>>>> next sequenceid=1162228063 >>>>>>> 2011-06-29 16:37:12,849 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >>>>>>> RS_ZK_REGION_OPENING >>>>>>> 2011-06-29 16:37:12,875 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >>>>>>> RS_ZK_REGION_OPENING >>>>>>> 2011-06-29 16:37:12,951 INFO >>>>> org.apache.hadoop.hbase.catalog.MetaEditor: >>>>>>> Updated row >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> in region .META.,,1 with server= >>> hadoop1-s05.farm-ny.gigya.com:60020, >>>>>>> startcode=1307349217076 >>>>>>> 2011-06-29 16:37:12,951 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >>>>>>> RS_ZK_REGION_OPENED >>>>>>> 2011-06-29 16:37:12,964 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >>>>>>> RS_ZK_REGION_OPENED >>>>>>> 2011-06-29 16:37:12,964 DEBUG >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >>>> Opened >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> 2011-06-29 16:40:00,878 INFO >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received >>> request >>>> to >>>>>>> open >>>>>>> region: >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> 2011-06-29 16:40:00,878 DEBUG >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >>>>>> Processing >>>>>>> open of >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> 2011-06-29 16:40:01,079 WARN >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >>>>> Attempted >>>>>>> open of >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> but already online on this server >>>>>>> 2011-06-29 16:43:50,395 DEBUG >>>>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread: >> Compaction >>>>>> (major) >>>>>>> requested for >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> because User-triggered major compaction; priority=1, compaction >>> queue >>>>>>> size=1248 >>>>>>> 2011-06-29 20:19:49,906 INFO >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >>>>>>> Starting major compaction on region >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> 2011-06-29 20:19:49,906 INFO >>>>> org.apache.hadoop.hbase.regionserver.Store: >>>>>>> Started compaction of 1 file(s) in cf=events into >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp, >>>>>>> seqid=1162228062, totalSize=98.3m >>>>>>> 2011-06-29 20:19:49,906 DEBUG >>>>> org.apache.hadoop.hbase.regionserver.Store: >>>>>>> Compacting >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, >>>>>>> keycount=6882816, bloomtype=NONE, size=98.3m >>>>>>> 2011-06-29 20:19:59,920 INFO >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >>>>>>> completed compaction on region >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> after 10sec >>>>>>> >>>>>>> And here is the one from hadoop1-s02: >>>>>>> 2011-06-29 16:43:57,935 INFO >>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received >>> request >>>> to >>>>>>> open >>>>>>> region: >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> 2011-06-29 16:43:58,990 DEBUG >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >>>>>> Processing >>>>>>> open of >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> 2011-06-29 16:43:58,990 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to >>>>>>> RS_ZK_REGION_OPENING >>>>>>> 2011-06-29 16:43:59,002 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to >>>>>>> RS_ZK_REGION_OPENING >>>>>>> 2011-06-29 16:43:59,002 DEBUG >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >>>>>>> Opening region: REGION => {NAME => >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.', >>>>>>> STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY => >>>>>>> 'GSLoad_1308518810_1249_WEB204', ENCODED => >>>>>>> 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME => >>> 'gs_raw_events', >>>>>>> FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE', >>>>> REPLICATION_SCOPE >>>>>> => >>>>>>> '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800', >>>> BLOCKSIZE >>>>> => >>>>>>> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} >>>>>>> 2011-06-29 16:43:59,003 DEBUG >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >>>>>>> Instantiated >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> 2011-06-29 16:43:59,204 DEBUG >>>>> org.apache.hadoop.hbase.regionserver.Store: >>>>>>> loaded >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, >>>>>>> isReference=false, isBulkLoadResult=false, seqid=1162228062, >>>>>>> majorCompaction=false >>>>>>> 2011-06-29 16:43:59,205 INFO >>>>>> org.apache.hadoop.hbase.regionserver.HRegion: >>>>>>> Onlined >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.; >>>>>>> next sequenceid=1162228063 >>>>>>> 2011-06-29 16:43:59,205 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >>>>>>> RS_ZK_REGION_OPENING >>>>>>> 2011-06-29 16:43:59,212 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >>>>>>> RS_ZK_REGION_OPENING >>>>>>> 2011-06-29 16:43:59,214 INFO >>>>> org.apache.hadoop.hbase.catalog.MetaEditor: >>>>>>> Updated row >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> in region .META.,,1 with server= >>> hadoop1-s02.farm-ny.gigya.com:60020, >>>>>>> startcode=1306919627544 >>>>>>> 2011-06-29 16:43:59,214 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >>>>>>> RS_ZK_REGION_OPENED >>>>>>> 2011-06-29 16:43:59,224 DEBUG >>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned >> node >>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to >>>>>>> RS_ZK_REGION_OPENED >>>>>>> 2011-06-29 16:43:59,224 DEBUG >>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: >>>> Opened >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>> java.io.IOException: Got error in response to OP_READ_BLOCK >> self=/ >>>>>>> 10.1.104.2:33356, remote=/10.1.104.2:50010 for file >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360 >>>>>>> for block 3674866614142268536_674205 >>>>>>> >>>>>>> >>>>>>> Ted, can you please point me to J-D's bug fix you mentioned? Are >>> you >>>>>>> positive it's the same scenario - data loss is a very serious >>> problem >>>>> for >>>>>> a >>>>>>> DB. >>>>>>> I'd really like to apply that patch ASAP, because when I run hbck >> I >>>> get >>>>>>> over >>>>>>> 400 regions which are multiply assigned. >>>>>>> Last question, I understand the region's data is lost but is >> there >>> a >>>>> way >>>>>> to >>>>>>> at least make the table consistent again by some how removing the >>>> lost >>>>>>> region? >>>>>>> >>>>>>> Thanks. >>>>>>> >>>>>>> -eran >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Jul 2, 2011 at 01:46, Ted Yu <yuzhih...@gmail.com> >> wrote: >>>>>>> >>>>>>>>>> 2011-06-29 16:43:57,880 INFO >>>>>>>> org.apache.hadoop.hbase. >>>>>>>> master.AssignmentManager: Region has been >>>>>>>> PENDING_OPEN for too long, reassigning >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>> >>>>>>>> The double assignment should have been fixed by J-D's recent >>>> checkin. >>>>>>>> >>>>>>>> On Fri, Jul 1, 2011 at 3:14 PM, Stack <st...@duboce.net> >> wrote: >>>>>>>> >>>>>>>>> Is >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>> the region that was having the issue? If so, if you looked >> in >>>>>>>>> hadoop1-s05's logs, was this region opened around 2011-06-29 >>>>>> 16:43:57? >>>>>>>>> Was it also opened hadoop1-s02 not long after? Did you say >>> what >>>>>>>>> version of hbase you are on? >>>>>>>>> >>>>>>>>> St.Ack >>>>>>>>> >>>>>>>>> On Fri, Jul 1, 2011 at 5:08 AM, Eran Kutner <e...@gigya.com> >>>>> wrote: >>>>>>>>>> Hi Stack, >>>>>>>>>> I'm not sure what the log means. I do see references to two >>>>>> different >>>>>>>>>> servers, but that would probably happen if there was normal >>>>>>> transition >>>>>>>> I >>>>>>>>>> assume. I'm using version 0.90.3 >>>>>>>>>> Here are the relevant lines from the master logs: >>>>>>>>>> >>>>>>>>>> 2011-06-19 21:39:37,164 INFO >>>>>>>>> org.apache.hadoop.hbase.master.ServerManager: >>>>>>>>>> Received REGION_SPLIT: >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658.: >>>>>>>>>> Daughters; >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518810_1249_WEB204,1308533970928.46f876a4e97be04edb35eb8f8959d482. >>>>>>>>>> from hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 >>>>>>>>>> 2011-06-19 21:43:12,983 INFO >>>>>>>> org.apache.hadoop.hbase.catalog.MetaEditor: >>>>>>>>>> Deleted daughter reference >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >>>>>>>>>> qualifier=splitA, from parent >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658. >>>>>>>>>> 2011-06-29 16:29:36,143 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Starting >>>>>>> unassignment >>>>>>>>> of >>>>>>>>>> region >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> (offlining) >>>>>>>>>> 2011-06-29 16:29:36,146 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Sent >> CLOSE >>> to >>>>>>>>> serverName= >>>>>>>>>> hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, >>>>>> load=(requests=0, >>>>>>>>>> regions=1654, usedHeap=1870, maxHeap=12483) for region >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> 2011-06-29 16:29:38,327 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >>> new >>>>>>>> unassigned >>>>>>>>>> node: /hbase/unassigned/584dac5cc70d8682f71c4675a843c309 >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> (region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, >>>>>>>>>> state=RS_ZK_REGION_CLOSED) >>>>>>>>>> 2011-06-29 16:29:38,327 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >>>>>>>>>> transition=RS_ZK_REGION_CLOSED, >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: >>>>>> Handling >>>>>>>>> CLOSED >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309 >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Table >> being >>>>>>> disabled >>>>>>>> so >>>>>>>>>> deleting ZK node and removing from regions in transition, >>>>> skipping >>>>>>>>>> assignment of region >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned >>>> node >>>>>> for >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state >>>>>>>>>> RS_ZK_REGION_CLOSED >>>>>>>>>> 2011-06-29 16:30:53,801 DEBUG >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted >>> unassigned >>>>> node >>>>>>> for >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state >>>>>>>>>> RS_ZK_REGION_CLOSED >>>>>>>>>> 2011-06-29 16:34:01,453 INFO >>>>>>>> org.apache.hadoop.hbase.catalog.MetaEditor: >>>>>>>>>> Updated region >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> in META >>>>>>>>>> 2011-06-29 16:37:12,247 DEBUG >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>>>>> master:60000-0x13004a31d7804c4 Creating (or updating) >>>> unassigned >>>>>> node >>>>>>>> for >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 with OFFLINE state >>>>>>>>>> 2011-06-29 16:37:12,576 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No >> previous >>>>>>>> transition >>>>>>>>>> plan was found (or we are ignoring an existing plan) for >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> so generated a random one; >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >>>>>>>>>> src=, dest=hadoop1-s05.farm-ny.gigya.com >>> ,60020,1307349217076; >>>> 5 >>>>>>>>> (online=5, >>>>>>>>>> exclude=null) available servers >>>>>>>>>> 2011-06-29 16:37:12,576 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning >>>>> region >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 >>>>>>>>>> 2011-06-29 16:37:13,102 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >>>>>>>>>> transition=RS_ZK_REGION_OPENED, >>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 >>>>>>>>>> 2011-06-29 16:39:54,075 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: >>>>>> Handling >>>>>>>>> OPENED >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309; deleting >>> unassigned >>>>>> node >>>>>>>>>> 2011-06-29 16:39:54,075 DEBUG >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned >>>> node >>>>>> for >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state >>>>>>>>>> RS_ZK_REGION_OPENED >>>>>>>>>> 2011-06-29 16:39:54,192 DEBUG >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted >>> unassigned >>>>> node >>>>>>> for >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state >>>>>>>>>> RS_ZK_REGION_OPENED >>>>>>>>>> 2011-06-29 16:39:54,326 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: >>>>> Opened >>>>>>>> region >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> on hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 >>>>>>>>>> 2011-06-29 16:40:00,598 DEBUG >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>>>>> master:60000-0x13004a31d7804c4 Creating (or updating) >>>> unassigned >>>>>> node >>>>>>>> for >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 with OFFLINE state >>>>>>>>>> 2011-06-29 16:40:00,877 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No >> previous >>>>>>>> transition >>>>>>>>>> plan was found (or we are ignoring an existing plan) for >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> so generated a random one; >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >>>>>>>>>> src=, dest=hadoop1-s05.farm-ny.gigya.com >>> ,60020,1307349217076; >>>> 5 >>>>>>>>> (online=5, >>>>>>>>>> exclude=null) available servers >>>>>>>>>> 2011-06-29 16:40:00,877 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning >>>>> region >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 >>>>>>>>>> 2011-06-29 16:43:57,879 INFO >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Regions >> in >>>>>>> transition >>>>>>>>>> timed out: >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> state=PENDING_OPEN, ts=1309380052723 >>>>>>>>>> 2011-06-29 16:43:57,880 INFO >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Region >> has >>>> been >>>>>>>>>> PENDING_OPEN for too long, reassigning >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Forcing >>>>> OFFLINE; >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> was=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> state=PENDING_OPEN, ts=1309380052723 >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No >> previous >>>>>>>> transition >>>>>>>>>> plan was found (or we are ignoring an existing plan) for >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> so generated a random one; >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., >>>>>>>>>> src=, dest=hadoop1-s02.farm-ny.gigya.com >>> ,60020,1306919627544; >>>> 5 >>>>>>>>> (online=5, >>>>>>>>>> exclude=null) available servers >>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning >>>>> region >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> to hadoop1-s02.farm-ny.gigya.com,60020,1306919627544 >>>>>>>>>> 2011-06-29 16:43:59,022 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >>>>>>>>>> transition=RS_ZK_REGION_OPENING, >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 >>>>>>>>>> 2011-06-29 16:43:59,221 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >>>>>>>>>> transition=RS_ZK_REGION_OPENING, >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 >>>>>>>>>> 2011-06-29 16:43:59,226 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling >>>>>>>>>> transition=RS_ZK_REGION_OPENED, >>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, >>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309 >>>>>>>>>> 2011-06-29 16:43:59,274 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: >>>>>> Handling >>>>>>>>> OPENED >>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309; deleting >>> unassigned >>>>>> node >>>>>>>>>> 2011-06-29 16:43:59,274 DEBUG >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned >>>> node >>>>>> for >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state >>>>>>>>>> RS_ZK_REGION_OPENED >>>>>>>>>> 2011-06-29 16:43:59,296 DEBUG >>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign: >>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted >>> unassigned >>>>> node >>>>>>> for >>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state >>>>>>>>>> RS_ZK_REGION_OPENED >>>>>>>>>> 2011-06-29 16:43:59,375 WARN >>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: >> Overwriting >>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 on >>>>>>>>>> serverName=hadoop1-s05.farm-ny.gigya.com >>> ,60020,1307349217076, >>>>>>>>>> load=(requests=0, regions=1273, usedHeap=2676, >> maxHeap=12483) >>>>>>>>>> 2011-06-29 16:43:59,375 DEBUG >>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: >>>>> Opened >>>>>>>> region >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. >>>>>>>>>> on hadoop1-s02.farm-ny.gigya.com,60020,1306919627544 >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> >>>>>>>>>> -eran >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jul 1, 2011 at 09:05, Stack <st...@duboce.net> >>> wrote: >>>>>>>>>> >>>>>>>>>>> So, Eran, it seems as though two RegionServers were >> carrying >>>> the >>>>>>>>>>> region? One deleted a file (compaction on its side)? Can >>> you >>>>>>> figure >>>>>>>>>>> if indeed two servers had same region? (Check master logs >>> for >>>>>> this >>>>>>>>>>> regions assignments). >>>>>>>>>>> >>>>>>>>>>> What version of hbase? >>>>>>>>>>> >>>>>>>>>>> St.Ack >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Jun 30, 2011 at 3:58 AM, Eran Kutner < >>> e...@gigya.com> >>>>>>> wrote: >>>>>>>>>>>> Hi, >>>>>>>>>>>> I have a cluster of 5 nodes with one large table that >>>>> currently >>>>>>> has >>>>>>>>>>> around >>>>>>>>>>>> 12000 regions. Everything was working fine for >> relatively >>>> long >>>>>>> time, >>>>>>>>>>> until >>>>>>>>>>>> now. >>>>>>>>>>>> Yesterday I significantly reduced the TTL on the table >> and >>>>>>> initiated >>>>>>>>>>> major >>>>>>>>>>>> compaction. This should have reduced the table size to >>> about >>>>> 20% >>>>>>> of >>>>>>>>> its >>>>>>>>>>>> original size. >>>>>>>>>>>> Today, I'm getting errors of inaccessible files on HDFS, >>> for >>>>>>>> example: >>>>>>>>>>>> java.io.IOException: Got error in response to >>> OP_READ_BLOCK >>>>>> self=/ >>>>>>>>>>>> 10.1.104.2:58047, remote=/10.1.104.2:50010 for file >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360 >>>>>>>>>>>> for block 3674866614142268536_674205 >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948) >>>>>>>>>>>> at >>>>> java.io.DataInputStream.read(DataInputStream.java:132) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105) >>>>>>>>>>>> at >>>>>>>>> >> java.io.BufferedInputStream.fill(BufferedInputStream.java:218) >>>>>>>>>>>> at >>>>>>>>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:237) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.io.compress.BlockDecompressorStream.rawReadInt(BlockDecompressorStream.java:128) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:68) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) >>>>>>>>>>>> at >>>>>>>>> >> java.io.BufferedInputStream.read1(BufferedInputStream.java:256) >>>>>>>>>>>> at >>>>>>>>> >> java.io.BufferedInputStream.read(BufferedInputStream.java:317) >>>>>>>>>>>> at >>>>>> org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>> >>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1433) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:139) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:96) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:77) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>> >>>>> >> org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1341) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.<init>(HRegion.java:2269) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1126) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1118) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1102) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1781) >>>>>>>>>>>> at >>>> sun.reflect.GeneratedMethodAccessor46.invoke(Unknown >>>>>>>> Source) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>>>>>>>>>>> at >> java.lang.reflect.Method.invoke(Method.java:597) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>> >> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) >>>>>>>>>>>> >>>>>>>>>>>> I checked and the file, indeed doesn't exist on HDFS, >> here >>>> is >>>>>> the >>>>>>>> name >>>>>>>>>>> node >>>>>>>>>>>> logs for this block, apparently because it was deleted: >>>>>>>>>>>> 2011-06-19 21:39:36,651 INFO >>>>> org.apache.hadoop.hdfs.StateChange: >>>>>>>>> BLOCK* >>>>>>>>>>>> NameSystem.allocateBlock: >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp/2096863423111131624. >>>>>>>>>>>> blk_3674866614142268536_674205 >>>>>>>>>>>> 2011-06-19 21:40:11,954 INFO >>>>> org.apache.hadoop.hdfs.StateChange: >>>>>>>>> BLOCK* >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated: >>>> 10.1.104.2:50010 >>>>> is >>>>>>>>> added >>>>>>>>>>> to >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864 >>>>>>>>>>>> 2011-06-19 21:40:11,954 INFO >>>>> org.apache.hadoop.hdfs.StateChange: >>>>>>>>> BLOCK* >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated: >>>> 10.1.104.3:50010 >>>>> is >>>>>>>>> added >>>>>>>>>>> to >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864 >>>>>>>>>>>> 2011-06-19 21:40:11,955 INFO >>>>> org.apache.hadoop.hdfs.StateChange: >>>>>>>>> BLOCK* >>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated: >>>> 10.1.104.5:50010 >>>>> is >>>>>>>>> added >>>>>>>>>>> to >>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864 >>>>>>>>>>>> 2011-06-29 20:20:01,662 INFO >>>>> org.apache.hadoop.hdfs.StateChange: >>>>>>>>> BLOCK* >>>>>>>>>>> ask >>>>>>>>>>>> 10.1.104.2:50010 to delete >>> blk_3674866614142268536_674205 >>>>>>>>>>>> 2011-06-29 20:20:13,671 INFO >>>>> org.apache.hadoop.hdfs.StateChange: >>>>>>>>> BLOCK* >>>>>>>>>>> ask >>>>>>>>>>>> 10.1.104.5:50010 to delete >>> blk_-4056387895369608597_675174 >>>>>>>>>>>> blk_-5017882805850873821_672281 >>>> blk_702373987100607684_672288 >>>>>>>>>>>> blk_-5357157478043290010_668506 >>>> blk_7118175133735412789_674903 >>>>>>>>>>>> blk_-3569812563715986384_675231 >>>> blk_8296855057240604851_669285 >>>>>>>>>>>> blk_-6483679172530609101_674268 >>>> blk_8738539715363739108_673682 >>>>>>>>>>>> blk_1744841904626813502_675238 >>>> blk_-6035315106100051103_674266 >>>>>>>>>>>> blk_-1789501623010070237_674908 >>>> blk_1944054629336265129_673689 >>>>>>>>>>>> blk_3674866614142268536_674205 >>>> blk_7930425446738143892_647410 >>>>>>>>>>>> blk_-3007186753042268449_669296 >>>>> blk_-5482302621772778061_647416 >>>>>>>>>>>> blk_-3765735404924932181_672004 >>>> blk_7476090998956811081_675169 >>>>>>>>>>>> blk_7862291659285127712_646890 >>>> blk_-2666244746343584727_672013 >>>>>>>>>>>> blk_6039172613960915602_674206 >>>> blk_-8470884397893086564_646899 >>>>>>>>>>>> blk_4558230221166712802_668510 >>>>>>>>>>>> 2011-06-29 20:20:46,698 INFO >>>>> org.apache.hadoop.hdfs.StateChange: >>>>>>>>> BLOCK* >>>>>>>>>>> ask >>>>>>>>>>>> 10.1.104.3:50010 to delete >>> blk_-7851606440036350812_671552 >>>>>>>>>>>> blk_9214649160203453845_647566 >>> blk_702373987100607684_672288 >>>>>>>>>>>> blk_5958099369749234073_668143 >>>> blk_-5172218034084903173_673109 >>>>>>>>>>>> blk_-2934555181472719276_646476 >>>>> blk_-1409986679370073931_672552 >>>>>>>>>>>> blk_-2786034325506235869_669086 >>>> blk_3674866614142268536_674205 >>>>>>>>>>>> blk_510158930393283118_673225 >>> blk_916244738216205237_677068 >>>>>>>>>>>> blk_-4317027806407316617_670379 >>>> blk_8555705688850972639_673485 >>>>>>>>>>>> blk_-3765735404924932181_672004 >>>>> blk_-5482302621772778061_647416 >>>>>>>>>>>> blk_-2461801145731752623_674605 >>>>> blk_-8737702908048998927_672549 >>>>>>>>>>>> blk_-8470884397893086564_646899 >>>> blk_4558230221166712802_668510 >>>>>>>>>>>> blk_-4056387895369608597_675174 >>>>> blk_-8675430610673886073_647695 >>>>>>>>>>>> blk_-6642870230256028318_668211 >>>>> blk_-3890408516362176771_677483 >>>>>>>>>>>> blk_-3569812563715986384_675231 >>>>> blk_-5007142629771321873_674548 >>>>>>>>>>>> blk_-3345355191863431669_667066 >>>> blk_8296855057240604851_669285 >>>>>>>>>>>> blk_-6595462308187757470_672420 >>>>> blk_-2583945228783203947_674607 >>>>>>>>>>>> blk_-346988625120916345_677063 >>>> blk_4449525876338684218_674496 >>>>>>>>>>>> blk_2617172363857549730_668201 >>>> blk_8738539715363739108_673682 >>>>>>>>>>>> blk_-208904675456598428_679286 >>>> blk_-497549341281882641_646477 >>>>>>>>>>>> blk_-6035315106100051103_674266 >>>>> blk_-2356539038067297411_672388 >>>>>>>>>>>> blk_-3881703084497103249_668137 >>>> blk_2214397881104950315_646643 >>>>>>>>>>>> blk_-5907671443455357710_673223 >>>>> blk_-2431880309956605679_669204 >>>>>>>>>>>> blk_6039172613960915602_674206 >>>> blk_5053643911633142711_669194 >>>>>>>>>>>> blk_-2636977729205236686_674664 >>>>>>>>>>>> >>>>>>>>>>>> I assume the file loss is somehow related to this change >>> and >>>>> the >>>>>>>> major >>>>>>>>>>>> compaction that followed because the same scan that is >>>> failing >>>>>> now >>>>>>>> was >>>>>>>>>>>> working fine yesterday and that is the only changed that >>>>>> happened >>>>>>> on >>>>>>>>> the >>>>>>>>>>>> cluster. Any suggestions what to do now? >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> >>>>>>>>>>>> -eran >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>