Thanks for the understanding. 

Can you log a JIRA and put your ideas below in it ?



On Jul 4, 2011, at 12:42 AM, Eran Kutner <e...@gigya.com> wrote:

> Thanks for the explanation Ted,
> 
> I will try to apply HBASE-3789 and hope for the best but my understanding is
> that it doesn't really solve the problem, it only reduces the probability of
> it happening, at least in one particular scenario. I would hope for a more
> robust solution.
> My concern is that the region allocation process seems to rely too much on
> timing considerations and doesn't seem to take enough measures to guarantee
> conflicts do not occur. I understand that in a distributed environment, when
> you don't get a timely response from a remote machine you can't know for
> sure if it did or did not receive the request, however there are things that
> can be done to mitigate this and reduce the conflict time significantly. For
> example, when I run dbck it knows that some regions are multiply assigned,
> the master could do the same and try to resolve the conflict. Another
> approach would be to handle late responses, even if the response from the
> remote machine arrives after it was assumed to be dead the master should
> have enough information to know it had created a conflict by assigning the
> region to another server. An even better solution, I think, is for the RS to
> periodically test that it is indeed the rightful owner of every region it
> holds and relinquish control over the region if it's not.
> Obviously a state where two RSs hold the same region is pathological and can
> lead to data loss, as demonstrated in my case. The system should be able to
> actively protect itself against such a scenario. It probably doesn't need
> saying but there is really nothing worse for a data storage system than data
> loss.
> 
> In my case the problem didn't happen in the initial phase but after
> disabling and enabling a table with about 12K regions.
> 
> -eran
> 
> 
> 
> On Sun, Jul 3, 2011 at 23:49, Ted Yu <yuzhih...@gmail.com> wrote:
> 
>> Let me try to answer some of your questions.
>> The two paragraphs below were written along my reasoning which is in
>> reverse
>> order of the actual call sequence.
>> 
>> For #4 below, the log indicates that the following was executed:
>> private void assign(final RegionState state, final boolean setOfflineInZK,
>>     final boolean forceNewPlan) {
>>   for (int i = 0; i < this.maximumAssignmentAttempts; i++) {
>>     if (setOfflineInZK && !*setOfflineInZooKeeper*(state)) return;
>> 
>> The above was due to the timeout which you noted in #2 which would have
>> caused
>> TimeoutMonitor.chore() to run this code (line 1787)
>> 
>>     for (Map.Entry<HRegionInfo, Boolean> e: assigns.entrySet()){
>>       assign(e.getKey(), false, e.getValue());
>>     }
>> 
>> This means there is lack of coordination between
>> assignmentManager.TimeoutMonitor and OpenedRegionHandler
>> 
>> The reason I mention HBASE-3789 is that it is marked as Incompatible change
>> and is in TRUNK already.
>> The application of HBASE-3789 to 0.90 branch would change the behavior
>> (timing) of region assignment.
>> 
>> I think it makes sense to evaluate the effect of HBASE-3789 in 0.90.4
>> 
>> BTW were the incorrect region assignments observed for a table with
>> multiple
>> initial regions ?
>> If so, I have HBASE-4010 in TRUNK which speeds up initial region assignment
>> by about 50%.
>> 
>> Cheers
>> 
>> On Sun, Jul 3, 2011 at 12:02 PM, Eran Kutner <e...@gigya.com> wrote:
>> 
>>> Ted,
>>> So if I understand correctly the the theory is that because of the issue
>>> fixed in HBASE-3789 the master took too long to detect that the region
>> was
>>> successfully opened by the first server so it forced closed it and
>>> transitioned to a second server, but there are a few things about this
>>> scenario I don't understand, probably because I don't know enough about
>> the
>>> inner workings of the region transition process and would appreciate it
>> if
>>> you can help me understand:
>>> 1. The RS opened the region at 16:37:49.
>>> 2. The master started handling the opened event at 16:39:54 - this delay
>>> can
>>> probably be explained by HBASE-3789
>>> 3. At 16:39:54 the master log says: Opened region gs_raw_events,..... on
>>> hadoop1-s05.farm-ny.gigya.com
>>> 4. Then at 16:40:00 the master log says: master:60000-0x13004a31d7804c4
>>> Creating (or updating) unassigned node for 584dac5cc70d8682f71c4675a843c3
>>> 09 with OFFLINE state - why did it decide to take the region offline
>> after
>>> learning it was successfully opened?
>>> 5. Then it tries to reopen the region on hadoop1-s05, which indicates in
>>> its
>>> log that the open request failed because the region was already open -
>> why
>>> didn't the master use that information to learn that the region was
>> already
>>> open?
>>> 6. At 16:43:57 the master decides the region transition timed out and
>>> starts
>>> forcing the transition - HBASE-3789 again?
>>> 7. Now the master forces the transition of the region to hadoop1-s02 but
>>> there is no sign of that on hadoop1-s05 - why doesn't the old RS
>>> (hadoop1-s05) detect that it is no longer the master and relinquishes
>>> control of the region?
>>> 
>>> Thanks.
>>> 
>>> -eran
>>> 
>>> 
>>> 
>>> On Sun, Jul 3, 2011 at 20:09, Ted Yu <yuzhih...@gmail.com> wrote:
>>> 
>>>> HBASE-3789 should have sped up region assignment.
>>>> The patch for 0.90 is attached to that JIRA.
>>>> 
>>>> You may prudently apply that patch.
>>>> 
>>>> Regards
>>>> 
>>>> On Sun, Jul 3, 2011 at 10:01 AM, Eran Kutner <e...@gigya.com> wrote:
>>>> 
>>>>> Thanks Ted, but, as stated before, I'm already using 0.90.3, so
>> either
>>>> it's
>>>>> not fixed or it's not the same thing.
>>>>> 
>>>>> -eran
>>>>> 
>>>>> 
>>>>> 
>>>>> On Sun, Jul 3, 2011 at 17:27, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>> 
>>>>>> Eran:
>>>>>> I was thinking of this:
>>>>>> HBASE-3789  Cleanup the locking contention in the master
>>>>>> 
>>>>>> though it doesn't directly handle 'PENDING_OPEN for too long' case.
>>>>>> 
>>>>>> https://issues.apache.org/jira/browse/HBASE-3741 is in 0.90.3 and
>>>>> actually
>>>>>> close to the symptom you described.
>>>>>> 
>>>>>> On Sun, Jul 3, 2011 at 12:00 AM, Eran Kutner <e...@gigya.com>
>> wrote:
>>>>>> 
>>>>>>> It does seem that both servers opened the same region around the
>>> same
>>>>>> time.
>>>>>>> The region was offline because I disabled the table so I can
>> change
>>>> its
>>>>>>> TTL.
>>>>>>> 
>>>>>>> Here is the log from haddop1-s05 :
>>>>>>> 2011-06-29 16:37:12,576 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received
>>> request
>>>> to
>>>>>>> open
>>>>>>> region:
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> 2011-06-29 16:37:12,680 DEBUG
>>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
>>>>>> Processing
>>>>>>> open of
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> 2011-06-29 16:37:12,680 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to
>>>>>>> RS_ZK_REGION_OPENING
>>>>>>> 2011-06-29 16:37:12,711 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to
>>>>>>> RS_ZK_REGION_OPENING
>>>>>>> 2011-06-29 16:37:12,711 DEBUG
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>>>> Opening region: REGION => {NAME =>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.',
>>>>>>> STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY =>
>>>>>>> 'GSLoad_1308518810_1249_WEB204', ENCODED =>
>>>>>>> 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME =>
>>> 'gs_raw_events',
>>>>>>> FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE',
>>>>> REPLICATION_SCOPE
>>>>>> =>
>>>>>>> '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800',
>>>> BLOCKSIZE
>>>>> =>
>>>>>>> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
>>>>>>> 2011-06-29 16:37:12,711 DEBUG
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>>>> Instantiated
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> 2011-06-29 16:37:12,847 DEBUG
>>>>> org.apache.hadoop.hbase.regionserver.Store:
>>>>>>> loaded
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360,
>>>>>>> isReference=false, isBulkLoadResult=false, seqid=1162228062,
>>>>>>> majorCompaction=false
>>>>>>> 2011-06-29 16:37:12,848 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>>>> Onlined
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.;
>>>>>>> next sequenceid=1162228063
>>>>>>> 2011-06-29 16:37:12,849 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
>>>>>>> RS_ZK_REGION_OPENING
>>>>>>> 2011-06-29 16:37:12,875 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
>>>>>>> RS_ZK_REGION_OPENING
>>>>>>> 2011-06-29 16:37:12,951 INFO
>>>>> org.apache.hadoop.hbase.catalog.MetaEditor:
>>>>>>> Updated row
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> in region .META.,,1 with server=
>>> hadoop1-s05.farm-ny.gigya.com:60020,
>>>>>>> startcode=1307349217076
>>>>>>> 2011-06-29 16:37:12,951 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x33004a38816050b Attempting to transition
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
>>>>>>> RS_ZK_REGION_OPENED
>>>>>>> 2011-06-29 16:37:12,964 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x33004a38816050b Successfully transitioned
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
>>>>>>> RS_ZK_REGION_OPENED
>>>>>>> 2011-06-29 16:37:12,964 DEBUG
>>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
>>>> Opened
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> 2011-06-29 16:40:00,878 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received
>>> request
>>>> to
>>>>>>> open
>>>>>>> region:
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> 2011-06-29 16:40:00,878 DEBUG
>>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
>>>>>> Processing
>>>>>>> open of
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> 2011-06-29 16:40:01,079 WARN
>>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
>>>>> Attempted
>>>>>>> open of
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> but already online on this server
>>>>>>> 2011-06-29 16:43:50,395 DEBUG
>>>>>>> org.apache.hadoop.hbase.regionserver.CompactSplitThread:
>> Compaction
>>>>>> (major)
>>>>>>> requested for
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> because User-triggered major compaction; priority=1, compaction
>>> queue
>>>>>>> size=1248
>>>>>>> 2011-06-29 20:19:49,906 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>>>> Starting major compaction on region
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> 2011-06-29 20:19:49,906 INFO
>>>>> org.apache.hadoop.hbase.regionserver.Store:
>>>>>>> Started compaction of 1 file(s) in cf=events  into
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp,
>>>>>>> seqid=1162228062, totalSize=98.3m
>>>>>>> 2011-06-29 20:19:49,906 DEBUG
>>>>> org.apache.hadoop.hbase.regionserver.Store:
>>>>>>> Compacting
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360,
>>>>>>> keycount=6882816, bloomtype=NONE, size=98.3m
>>>>>>> 2011-06-29 20:19:59,920 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>>>> completed compaction on region
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> after 10sec
>>>>>>> 
>>>>>>> And here is the one from hadoop1-s02:
>>>>>>> 2011-06-29 16:43:57,935 INFO
>>>>>>> org.apache.hadoop.hbase.regionserver.HRegionServer: Received
>>> request
>>>> to
>>>>>>> open
>>>>>>> region:
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> 2011-06-29 16:43:58,990 DEBUG
>>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
>>>>>> Processing
>>>>>>> open of
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> 2011-06-29 16:43:58,990 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to
>>>>>>> RS_ZK_REGION_OPENING
>>>>>>> 2011-06-29 16:43:59,002 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to
>>>>>>> RS_ZK_REGION_OPENING
>>>>>>> 2011-06-29 16:43:59,002 DEBUG
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>>>> Opening region: REGION => {NAME =>
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.',
>>>>>>> STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY =>
>>>>>>> 'GSLoad_1308518810_1249_WEB204', ENCODED =>
>>>>>>> 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME =>
>>> 'gs_raw_events',
>>>>>>> FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE',
>>>>> REPLICATION_SCOPE
>>>>>> =>
>>>>>>> '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800',
>>>> BLOCKSIZE
>>>>> =>
>>>>>>> '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}}
>>>>>>> 2011-06-29 16:43:59,003 DEBUG
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>>>> Instantiated
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> 2011-06-29 16:43:59,204 DEBUG
>>>>> org.apache.hadoop.hbase.regionserver.Store:
>>>>>>> loaded
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360,
>>>>>>> isReference=false, isBulkLoadResult=false, seqid=1162228062,
>>>>>>> majorCompaction=false
>>>>>>> 2011-06-29 16:43:59,205 INFO
>>>>>> org.apache.hadoop.hbase.regionserver.HRegion:
>>>>>>> Onlined
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.;
>>>>>>> next sequenceid=1162228063
>>>>>>> 2011-06-29 16:43:59,205 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
>>>>>>> RS_ZK_REGION_OPENING
>>>>>>> 2011-06-29 16:43:59,212 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
>>>>>>> RS_ZK_REGION_OPENING
>>>>>>> 2011-06-29 16:43:59,214 INFO
>>>>> org.apache.hadoop.hbase.catalog.MetaEditor:
>>>>>>> Updated row
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> in region .META.,,1 with server=
>>> hadoop1-s02.farm-ny.gigya.com:60020,
>>>>>>> startcode=1306919627544
>>>>>>> 2011-06-29 16:43:59,214 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x23004a31d8904de Attempting to transition
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
>>>>>>> RS_ZK_REGION_OPENED
>>>>>>> 2011-06-29 16:43:59,224 DEBUG
>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>> regionserver:60020-0x23004a31d8904de Successfully transitioned
>> node
>>>>>>> 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to
>>>>>>> RS_ZK_REGION_OPENED
>>>>>>> 2011-06-29 16:43:59,224 DEBUG
>>>>>>> org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler:
>>>> Opened
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>> java.io.IOException: Got error in response to OP_READ_BLOCK
>> self=/
>>>>>>> 10.1.104.2:33356, remote=/10.1.104.2:50010 for file
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360
>>>>>>> for block 3674866614142268536_674205
>>>>>>> 
>>>>>>> 
>>>>>>> Ted, can you please point me to J-D's bug fix you mentioned? Are
>>> you
>>>>>>> positive it's the same scenario - data loss is a very serious
>>> problem
>>>>> for
>>>>>> a
>>>>>>> DB.
>>>>>>> I'd really like to apply that patch ASAP, because when I run hbck
>> I
>>>> get
>>>>>>> over
>>>>>>> 400 regions which are multiply assigned.
>>>>>>> Last question, I understand the region's data is lost but is
>> there
>>> a
>>>>> way
>>>>>> to
>>>>>>> at least make the table consistent again by some how removing the
>>>> lost
>>>>>>> region?
>>>>>>> 
>>>>>>> Thanks.
>>>>>>> 
>>>>>>> -eran
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Sat, Jul 2, 2011 at 01:46, Ted Yu <yuzhih...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>>>>> 2011-06-29 16:43:57,880 INFO
>>>>>>>> org.apache.hadoop.hbase.
>>>>>>>> master.AssignmentManager: Region has been
>>>>>>>> PENDING_OPEN for too long, reassigning
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>> 
>>>>>>>> The double assignment should have been fixed by J-D's recent
>>>> checkin.
>>>>>>>> 
>>>>>>>> On Fri, Jul 1, 2011 at 3:14 PM, Stack <st...@duboce.net>
>> wrote:
>>>>>>>> 
>>>>>>>>> Is
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>> the region that was having the issue?  If so, if you looked
>> in
>>>>>>>>> hadoop1-s05's logs, was this region opened around 2011-06-29
>>>>>> 16:43:57?
>>>>>>>>> Was it also opened hadoop1-s02 not long after?  Did you say
>>> what
>>>>>>>>> version of hbase you are on?
>>>>>>>>> 
>>>>>>>>> St.Ack
>>>>>>>>> 
>>>>>>>>> On Fri, Jul 1, 2011 at 5:08 AM, Eran Kutner <e...@gigya.com>
>>>>> wrote:
>>>>>>>>>> Hi Stack,
>>>>>>>>>> I'm not sure what the log means. I do see references to two
>>>>>> different
>>>>>>>>>> servers, but that would probably happen if there was normal
>>>>>>> transition
>>>>>>>> I
>>>>>>>>>> assume. I'm using version 0.90.3
>>>>>>>>>> Here are the relevant lines from the master logs:
>>>>>>>>>> 
>>>>>>>>>> 2011-06-19 21:39:37,164 INFO
>>>>>>>>> org.apache.hadoop.hbase.master.ServerManager:
>>>>>>>>>> Received REGION_SPLIT:
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658.:
>>>>>>>>>> Daughters;
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518810_1249_WEB204,1308533970928.46f876a4e97be04edb35eb8f8959d482.
>>>>>>>>>> from hadoop1-s05.farm-ny.gigya.com,60020,1307349217076
>>>>>>>>>> 2011-06-19 21:43:12,983 INFO
>>>>>>>> org.apache.hadoop.hbase.catalog.MetaEditor:
>>>>>>>>>> Deleted daughter reference
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
>>>>>>>>>> qualifier=splitA, from parent
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658.
>>>>>>>>>> 2011-06-29 16:29:36,143 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Starting
>>>>>>> unassignment
>>>>>>>>> of
>>>>>>>>>> region
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> (offlining)
>>>>>>>>>> 2011-06-29 16:29:36,146 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Sent
>> CLOSE
>>> to
>>>>>>>>> serverName=
>>>>>>>>>> hadoop1-s05.farm-ny.gigya.com,60020,1307349217076,
>>>>>> load=(requests=0,
>>>>>>>>>> regions=1654, usedHeap=1870, maxHeap=12483) for region
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> 2011-06-29 16:29:38,327 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
>>> new
>>>>>>>> unassigned
>>>>>>>>>> node: /hbase/unassigned/584dac5cc70d8682f71c4675a843c309
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> (region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
>>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076,
>>>>>>>>>> state=RS_ZK_REGION_CLOSED)
>>>>>>>>>> 2011-06-29 16:29:38,327 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
>>>>>>>>>> transition=RS_ZK_REGION_CLOSED,
>>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076,
>>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309
>>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.handler.ClosedRegionHandler:
>>>>>> Handling
>>>>>>>>> CLOSED
>>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309
>>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Table
>> being
>>>>>>> disabled
>>>>>>>> so
>>>>>>>>>> deleting ZK node and removing from regions in transition,
>>>>> skipping
>>>>>>>>>> assignment of region
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> 2011-06-29 16:30:53,742 DEBUG
>>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned
>>>> node
>>>>>> for
>>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state
>>>>>>>>>> RS_ZK_REGION_CLOSED
>>>>>>>>>> 2011-06-29 16:30:53,801 DEBUG
>>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted
>>> unassigned
>>>>> node
>>>>>>> for
>>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state
>>>>>>>>>> RS_ZK_REGION_CLOSED
>>>>>>>>>> 2011-06-29 16:34:01,453 INFO
>>>>>>>> org.apache.hadoop.hbase.catalog.MetaEditor:
>>>>>>>>>> Updated region
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> in META
>>>>>>>>>> 2011-06-29 16:37:12,247 DEBUG
>>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>>>>> master:60000-0x13004a31d7804c4 Creating (or updating)
>>>> unassigned
>>>>>> node
>>>>>>>> for
>>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 with OFFLINE state
>>>>>>>>>> 2011-06-29 16:37:12,576 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No
>> previous
>>>>>>>> transition
>>>>>>>>>> plan was found (or we are ignoring an existing plan) for
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> so generated a random one;
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
>>>>>>>>>> src=, dest=hadoop1-s05.farm-ny.gigya.com
>>> ,60020,1307349217076;
>>>> 5
>>>>>>>>> (online=5,
>>>>>>>>>> exclude=null) available servers
>>>>>>>>>> 2011-06-29 16:37:12,576 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning
>>>>> region
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076
>>>>>>>>>> 2011-06-29 16:37:13,102 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
>>>>>>>>>> transition=RS_ZK_REGION_OPENED,
>>>>>>>>>> server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076,
>>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309
>>>>>>>>>> 2011-06-29 16:39:54,075 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
>>>>>> Handling
>>>>>>>>> OPENED
>>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309; deleting
>>> unassigned
>>>>>> node
>>>>>>>>>> 2011-06-29 16:39:54,075 DEBUG
>>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned
>>>> node
>>>>>> for
>>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state
>>>>>>>>>> RS_ZK_REGION_OPENED
>>>>>>>>>> 2011-06-29 16:39:54,192 DEBUG
>>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted
>>> unassigned
>>>>> node
>>>>>>> for
>>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state
>>>>>>>>>> RS_ZK_REGION_OPENED
>>>>>>>>>> 2011-06-29 16:39:54,326 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
>>>>> Opened
>>>>>>>> region
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> on hadoop1-s05.farm-ny.gigya.com,60020,1307349217076
>>>>>>>>>> 2011-06-29 16:40:00,598 DEBUG
>>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>>>>> master:60000-0x13004a31d7804c4 Creating (or updating)
>>>> unassigned
>>>>>> node
>>>>>>>> for
>>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 with OFFLINE state
>>>>>>>>>> 2011-06-29 16:40:00,877 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No
>> previous
>>>>>>>> transition
>>>>>>>>>> plan was found (or we are ignoring an existing plan) for
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> so generated a random one;
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
>>>>>>>>>> src=, dest=hadoop1-s05.farm-ny.gigya.com
>>> ,60020,1307349217076;
>>>> 5
>>>>>>>>> (online=5,
>>>>>>>>>> exclude=null) available servers
>>>>>>>>>> 2011-06-29 16:40:00,877 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning
>>>>> region
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076
>>>>>>>>>> 2011-06-29 16:43:57,879 INFO
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Regions
>> in
>>>>>>> transition
>>>>>>>>>> timed out:
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> state=PENDING_OPEN, ts=1309380052723
>>>>>>>>>> 2011-06-29 16:43:57,880 INFO
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Region
>> has
>>>> been
>>>>>>>>>> PENDING_OPEN for too long, reassigning
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Forcing
>>>>> OFFLINE;
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> was=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> state=PENDING_OPEN, ts=1309380052723
>>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: No
>> previous
>>>>>>>> transition
>>>>>>>>>> plan was found (or we are ignoring an existing plan) for
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> so generated a random one;
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.,
>>>>>>>>>> src=, dest=hadoop1-s02.farm-ny.gigya.com
>>> ,60020,1306919627544;
>>>> 5
>>>>>>>>> (online=5,
>>>>>>>>>> exclude=null) available servers
>>>>>>>>>> 2011-06-29 16:43:57,936 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Assigning
>>>>> region
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> to hadoop1-s02.farm-ny.gigya.com,60020,1306919627544
>>>>>>>>>> 2011-06-29 16:43:59,022 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
>>>>>>>>>> transition=RS_ZK_REGION_OPENING,
>>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544,
>>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309
>>>>>>>>>> 2011-06-29 16:43:59,221 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
>>>>>>>>>> transition=RS_ZK_REGION_OPENING,
>>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544,
>>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309
>>>>>>>>>> 2011-06-29 16:43:59,226 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager: Handling
>>>>>>>>>> transition=RS_ZK_REGION_OPENED,
>>>>>>>>>> server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544,
>>>>>>>>>> region=584dac5cc70d8682f71c4675a843c309
>>>>>>>>>> 2011-06-29 16:43:59,274 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
>>>>>> Handling
>>>>>>>>> OPENED
>>>>>>>>>> event for 584dac5cc70d8682f71c4675a843c309; deleting
>>> unassigned
>>>>>> node
>>>>>>>>>> 2011-06-29 16:43:59,274 DEBUG
>>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>>>>> master:60000-0x13004a31d7804c4 Deleting existing unassigned
>>>> node
>>>>>> for
>>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 that is in expected state
>>>>>>>>>> RS_ZK_REGION_OPENED
>>>>>>>>>> 2011-06-29 16:43:59,296 DEBUG
>>>>>>>> org.apache.hadoop.hbase.zookeeper.ZKAssign:
>>>>>>>>>> master:60000-0x13004a31d7804c4 Successfully deleted
>>> unassigned
>>>>> node
>>>>>>> for
>>>>>>>>>> region 584dac5cc70d8682f71c4675a843c309 in expected state
>>>>>>>>>> RS_ZK_REGION_OPENED
>>>>>>>>>> 2011-06-29 16:43:59,375 WARN
>>>>>>>>>> org.apache.hadoop.hbase.master.AssignmentManager:
>> Overwriting
>>>>>>>>>> 584dac5cc70d8682f71c4675a843c309 on
>>>>>>>>>> serverName=hadoop1-s05.farm-ny.gigya.com
>>> ,60020,1307349217076,
>>>>>>>>>> load=(requests=0, regions=1273, usedHeap=2676,
>> maxHeap=12483)
>>>>>>>>>> 2011-06-29 16:43:59,375 DEBUG
>>>>>>>>>> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler:
>>>>> Opened
>>>>>>>> region
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.
>>>>>>>>>> on hadoop1-s02.farm-ny.gigya.com,60020,1306919627544
>>>>>>>>>> 
>>>>>>>>>> Thanks.
>>>>>>>>>> 
>>>>>>>>>> -eran
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Fri, Jul 1, 2011 at 09:05, Stack <st...@duboce.net>
>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> So, Eran, it seems as though two RegionServers were
>> carrying
>>>> the
>>>>>>>>>>> region?  One deleted a file (compaction on its side)?  Can
>>> you
>>>>>>> figure
>>>>>>>>>>> if indeed two servers had same region?  (Check master logs
>>> for
>>>>>> this
>>>>>>>>>>> regions assignments).
>>>>>>>>>>> 
>>>>>>>>>>> What version of hbase?
>>>>>>>>>>> 
>>>>>>>>>>> St.Ack
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Jun 30, 2011 at 3:58 AM, Eran Kutner <
>>> e...@gigya.com>
>>>>>>> wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>> I have a cluster of 5 nodes with one large table that
>>>>> currently
>>>>>>> has
>>>>>>>>>>> around
>>>>>>>>>>>> 12000 regions. Everything was working fine for
>> relatively
>>>> long
>>>>>>> time,
>>>>>>>>>>> until
>>>>>>>>>>>> now.
>>>>>>>>>>>> Yesterday I significantly reduced the TTL on the table
>> and
>>>>>>> initiated
>>>>>>>>>>> major
>>>>>>>>>>>> compaction. This should have reduced the table size to
>>> about
>>>>> 20%
>>>>>>> of
>>>>>>>>> its
>>>>>>>>>>>> original size.
>>>>>>>>>>>> Today, I'm getting errors of inaccessible files on HDFS,
>>> for
>>>>>>>> example:
>>>>>>>>>>>> java.io.IOException: Got error in response to
>>> OP_READ_BLOCK
>>>>>> self=/
>>>>>>>>>>>> 10.1.104.2:58047, remote=/10.1.104.2:50010 for file
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360
>>>>>>>>>>>> for block 3674866614142268536_674205
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948)
>>>>>>>>>>>>       at
>>>>> java.io.DataInputStream.read(DataInputStream.java:132)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105)
>>>>>>>>>>>>       at
>>>>>>>>> 
>> java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>>>>>>>>>>>>       at
>>>>>>>>> 
>> java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.io.compress.BlockDecompressorStream.rawReadInt(BlockDecompressorStream.java:128)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:68)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75)
>>>>>>>>>>>>       at
>>>>>>>>> 
>> java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>>>>>>>>>>>>       at
>>>>>>>>> 
>> java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>>>>>>>>>>>>       at
>>>>>> org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>> 
>>> org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1433)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:139)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:96)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:77)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>> 
>>>>> 
>> org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1341)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.<init>(HRegion.java:2269)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1126)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1118)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1102)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1781)
>>>>>>>>>>>>       at
>>>> sun.reflect.GeneratedMethodAccessor46.invoke(Unknown
>>>>>>>> Source)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>>>>>>>>>       at
>> java.lang.reflect.Method.invoke(Method.java:597)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>> 
>> org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570)
>>>>>>>>>>>>       at
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039)
>>>>>>>>>>>> 
>>>>>>>>>>>> I checked and the file, indeed doesn't exist on HDFS,
>> here
>>>> is
>>>>>> the
>>>>>>>> name
>>>>>>>>>>> node
>>>>>>>>>>>> logs for this block, apparently because it was deleted:
>>>>>>>>>>>> 2011-06-19 21:39:36,651 INFO
>>>>> org.apache.hadoop.hdfs.StateChange:
>>>>>>>>> BLOCK*
>>>>>>>>>>>> NameSystem.allocateBlock:
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp/2096863423111131624.
>>>>>>>>>>>> blk_3674866614142268536_674205
>>>>>>>>>>>> 2011-06-19 21:40:11,954 INFO
>>>>> org.apache.hadoop.hdfs.StateChange:
>>>>>>>>> BLOCK*
>>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated:
>>>> 10.1.104.2:50010
>>>>> is
>>>>>>>>> added
>>>>>>>>>>> to
>>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864
>>>>>>>>>>>> 2011-06-19 21:40:11,954 INFO
>>>>> org.apache.hadoop.hdfs.StateChange:
>>>>>>>>> BLOCK*
>>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated:
>>>> 10.1.104.3:50010
>>>>> is
>>>>>>>>> added
>>>>>>>>>>> to
>>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864
>>>>>>>>>>>> 2011-06-19 21:40:11,955 INFO
>>>>> org.apache.hadoop.hdfs.StateChange:
>>>>>>>>> BLOCK*
>>>>>>>>>>>> NameSystem.addStoredBlock: blockMap updated:
>>>> 10.1.104.5:50010
>>>>> is
>>>>>>>>> added
>>>>>>>>>>> to
>>>>>>>>>>>> blk_3674866614142268536_674205 size 67108864
>>>>>>>>>>>> 2011-06-29 20:20:01,662 INFO
>>>>> org.apache.hadoop.hdfs.StateChange:
>>>>>>>>> BLOCK*
>>>>>>>>>>> ask
>>>>>>>>>>>> 10.1.104.2:50010 to delete
>>> blk_3674866614142268536_674205
>>>>>>>>>>>> 2011-06-29 20:20:13,671 INFO
>>>>> org.apache.hadoop.hdfs.StateChange:
>>>>>>>>> BLOCK*
>>>>>>>>>>> ask
>>>>>>>>>>>> 10.1.104.5:50010 to delete
>>> blk_-4056387895369608597_675174
>>>>>>>>>>>> blk_-5017882805850873821_672281
>>>> blk_702373987100607684_672288
>>>>>>>>>>>> blk_-5357157478043290010_668506
>>>> blk_7118175133735412789_674903
>>>>>>>>>>>> blk_-3569812563715986384_675231
>>>> blk_8296855057240604851_669285
>>>>>>>>>>>> blk_-6483679172530609101_674268
>>>> blk_8738539715363739108_673682
>>>>>>>>>>>> blk_1744841904626813502_675238
>>>> blk_-6035315106100051103_674266
>>>>>>>>>>>> blk_-1789501623010070237_674908
>>>> blk_1944054629336265129_673689
>>>>>>>>>>>> blk_3674866614142268536_674205
>>>> blk_7930425446738143892_647410
>>>>>>>>>>>> blk_-3007186753042268449_669296
>>>>> blk_-5482302621772778061_647416
>>>>>>>>>>>> blk_-3765735404924932181_672004
>>>> blk_7476090998956811081_675169
>>>>>>>>>>>> blk_7862291659285127712_646890
>>>> blk_-2666244746343584727_672013
>>>>>>>>>>>> blk_6039172613960915602_674206
>>>> blk_-8470884397893086564_646899
>>>>>>>>>>>> blk_4558230221166712802_668510
>>>>>>>>>>>> 2011-06-29 20:20:46,698 INFO
>>>>> org.apache.hadoop.hdfs.StateChange:
>>>>>>>>> BLOCK*
>>>>>>>>>>> ask
>>>>>>>>>>>> 10.1.104.3:50010 to delete
>>> blk_-7851606440036350812_671552
>>>>>>>>>>>> blk_9214649160203453845_647566
>>> blk_702373987100607684_672288
>>>>>>>>>>>> blk_5958099369749234073_668143
>>>> blk_-5172218034084903173_673109
>>>>>>>>>>>> blk_-2934555181472719276_646476
>>>>> blk_-1409986679370073931_672552
>>>>>>>>>>>> blk_-2786034325506235869_669086
>>>> blk_3674866614142268536_674205
>>>>>>>>>>>> blk_510158930393283118_673225
>>> blk_916244738216205237_677068
>>>>>>>>>>>> blk_-4317027806407316617_670379
>>>> blk_8555705688850972639_673485
>>>>>>>>>>>> blk_-3765735404924932181_672004
>>>>> blk_-5482302621772778061_647416
>>>>>>>>>>>> blk_-2461801145731752623_674605
>>>>> blk_-8737702908048998927_672549
>>>>>>>>>>>> blk_-8470884397893086564_646899
>>>> blk_4558230221166712802_668510
>>>>>>>>>>>> blk_-4056387895369608597_675174
>>>>> blk_-8675430610673886073_647695
>>>>>>>>>>>> blk_-6642870230256028318_668211
>>>>> blk_-3890408516362176771_677483
>>>>>>>>>>>> blk_-3569812563715986384_675231
>>>>> blk_-5007142629771321873_674548
>>>>>>>>>>>> blk_-3345355191863431669_667066
>>>> blk_8296855057240604851_669285
>>>>>>>>>>>> blk_-6595462308187757470_672420
>>>>> blk_-2583945228783203947_674607
>>>>>>>>>>>> blk_-346988625120916345_677063
>>>> blk_4449525876338684218_674496
>>>>>>>>>>>> blk_2617172363857549730_668201
>>>> blk_8738539715363739108_673682
>>>>>>>>>>>> blk_-208904675456598428_679286
>>>> blk_-497549341281882641_646477
>>>>>>>>>>>> blk_-6035315106100051103_674266
>>>>> blk_-2356539038067297411_672388
>>>>>>>>>>>> blk_-3881703084497103249_668137
>>>> blk_2214397881104950315_646643
>>>>>>>>>>>> blk_-5907671443455357710_673223
>>>>> blk_-2431880309956605679_669204
>>>>>>>>>>>> blk_6039172613960915602_674206
>>>> blk_5053643911633142711_669194
>>>>>>>>>>>> blk_-2636977729205236686_674664
>>>>>>>>>>>> 
>>>>>>>>>>>> I assume the file loss is somehow related to this change
>>> and
>>>>> the
>>>>>>>> major
>>>>>>>>>>>> compaction that followed because the same scan that is
>>>> failing
>>>>>> now
>>>>>>>> was
>>>>>>>>>>>> working fine yesterday and that is the only changed that
>>>>>> happened
>>>>>>> on
>>>>>>>>> the
>>>>>>>>>>>> cluster. Any suggestions what to do now?
>>>>>>>>>>>> 
>>>>>>>>>>>> Thanks.
>>>>>>>>>>>> 
>>>>>>>>>>>> -eran
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 

Reply via email to