Ted, So if I understand correctly the the theory is that because of the issue fixed in HBASE-3789 the master took too long to detect that the region was successfully opened by the first server so it forced closed it and transitioned to a second server, but there are a few things about this scenario I don't understand, probably because I don't know enough about the inner workings of the region transition process and would appreciate it if you can help me understand: 1. The RS opened the region at 16:37:49. 2. The master started handling the opened event at 16:39:54 - this delay can probably be explained by HBASE-3789 3. At 16:39:54 the master log says: Opened region gs_raw_events,..... on hadoop1-s05.farm-ny.gigya.com 4. Then at 16:40:00 the master log says: master:60000-0x13004a31d7804c4 Creating (or updating) unassigned node for 584dac5cc70d8682f71c4675a843c3 09 with OFFLINE state - why did it decide to take the region offline after learning it was successfully opened? 5. Then it tries to reopen the region on hadoop1-s05, which indicates in its log that the open request failed because the region was already open - why didn't the master use that information to learn that the region was already open? 6. At 16:43:57 the master decides the region transition timed out and starts forcing the transition - HBASE-3789 again? 7. Now the master forces the transition of the region to hadoop1-s02 but there is no sign of that on hadoop1-s05 - why doesn't the old RS (hadoop1-s05) detect that it is no longer the master and relinquishes control of the region?
Thanks. -eran On Sun, Jul 3, 2011 at 20:09, Ted Yu <yuzhih...@gmail.com> wrote: > HBASE-3789 should have sped up region assignment. > The patch for 0.90 is attached to that JIRA. > > You may prudently apply that patch. > > Regards > > On Sun, Jul 3, 2011 at 10:01 AM, Eran Kutner <e...@gigya.com> wrote: > > > Thanks Ted, but, as stated before, I'm already using 0.90.3, so either > it's > > not fixed or it's not the same thing. > > > > -eran > > > > > > > > On Sun, Jul 3, 2011 at 17:27, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > Eran: > > > I was thinking of this: > > > HBASE-3789 Cleanup the locking contention in the master > > > > > > though it doesn't directly handle 'PENDING_OPEN for too long' case. > > > > > > https://issues.apache.org/jira/browse/HBASE-3741 is in 0.90.3 and > > actually > > > close to the symptom you described. > > > > > > On Sun, Jul 3, 2011 at 12:00 AM, Eran Kutner <e...@gigya.com> wrote: > > > > > > > It does seem that both servers opened the same region around the same > > > time. > > > > The region was offline because I disabled the table so I can change > its > > > > TTL. > > > > > > > > Here is the log from haddop1-s05 : > > > > 2011-06-29 16:37:12,576 INFO > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Received request > to > > > > open > > > > region: > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > 2011-06-29 16:37:12,680 DEBUG > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > > > Processing > > > > open of > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > 2011-06-29 16:37:12,680 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x33004a38816050b Attempting to transition node > > > > 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to > > > > RS_ZK_REGION_OPENING > > > > 2011-06-29 16:37:12,711 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x33004a38816050b Successfully transitioned node > > > > 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to > > > > RS_ZK_REGION_OPENING > > > > 2011-06-29 16:37:12,711 DEBUG > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > Opening region: REGION => {NAME => > > > > > > > > > > > > > > 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.', > > > > STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY => > > > > 'GSLoad_1308518810_1249_WEB204', ENCODED => > > > > 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME => 'gs_raw_events', > > > > FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE', > > REPLICATION_SCOPE > > > => > > > > '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800', > BLOCKSIZE > > => > > > > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} > > > > 2011-06-29 16:37:12,711 DEBUG > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > Instantiated > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > 2011-06-29 16:37:12,847 DEBUG > > org.apache.hadoop.hbase.regionserver.Store: > > > > loaded > > > > > > > > > > > > > > hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, > > > > isReference=false, isBulkLoadResult=false, seqid=1162228062, > > > > majorCompaction=false > > > > 2011-06-29 16:37:12,848 INFO > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > Onlined > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.; > > > > next sequenceid=1162228063 > > > > 2011-06-29 16:37:12,849 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x33004a38816050b Attempting to transition node > > > > 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > > > RS_ZK_REGION_OPENING > > > > 2011-06-29 16:37:12,875 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x33004a38816050b Successfully transitioned node > > > > 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > > > RS_ZK_REGION_OPENING > > > > 2011-06-29 16:37:12,951 INFO > > org.apache.hadoop.hbase.catalog.MetaEditor: > > > > Updated row > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > in region .META.,,1 with server=hadoop1-s05.farm-ny.gigya.com:60020, > > > > startcode=1307349217076 > > > > 2011-06-29 16:37:12,951 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x33004a38816050b Attempting to transition node > > > > 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > > > RS_ZK_REGION_OPENED > > > > 2011-06-29 16:37:12,964 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x33004a38816050b Successfully transitioned node > > > > 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > > > RS_ZK_REGION_OPENED > > > > 2011-06-29 16:37:12,964 DEBUG > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > Opened > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > 2011-06-29 16:40:00,878 INFO > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Received request > to > > > > open > > > > region: > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > 2011-06-29 16:40:00,878 DEBUG > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > > > Processing > > > > open of > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > 2011-06-29 16:40:01,079 WARN > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > > Attempted > > > > open of > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > but already online on this server > > > > 2011-06-29 16:43:50,395 DEBUG > > > > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction > > > (major) > > > > requested for > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > because User-triggered major compaction; priority=1, compaction queue > > > > size=1248 > > > > 2011-06-29 20:19:49,906 INFO > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > Starting major compaction on region > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > 2011-06-29 20:19:49,906 INFO > > org.apache.hadoop.hbase.regionserver.Store: > > > > Started compaction of 1 file(s) in cf=events into > > > > > > > > > > > > > > hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp, > > > > seqid=1162228062, totalSize=98.3m > > > > 2011-06-29 20:19:49,906 DEBUG > > org.apache.hadoop.hbase.regionserver.Store: > > > > Compacting > > > > > > > > > > > > > > hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, > > > > keycount=6882816, bloomtype=NONE, size=98.3m > > > > 2011-06-29 20:19:59,920 INFO > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > completed compaction on region > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > after 10sec > > > > > > > > And here is the one from hadoop1-s02: > > > > 2011-06-29 16:43:57,935 INFO > > > > org.apache.hadoop.hbase.regionserver.HRegionServer: Received request > to > > > > open > > > > region: > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > 2011-06-29 16:43:58,990 DEBUG > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > > > Processing > > > > open of > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > 2011-06-29 16:43:58,990 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x23004a31d8904de Attempting to transition node > > > > 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to > > > > RS_ZK_REGION_OPENING > > > > 2011-06-29 16:43:59,002 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x23004a31d8904de Successfully transitioned node > > > > 584dac5cc70d8682f71c4675a843c309 from M_ZK_REGION_OFFLINE to > > > > RS_ZK_REGION_OPENING > > > > 2011-06-29 16:43:59,002 DEBUG > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > Opening region: REGION => {NAME => > > > > > > > > > > > > > > 'gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.', > > > > STARTKEY => 'GSLoad_1308518553_168_WEB204', ENDKEY => > > > > 'GSLoad_1308518810_1249_WEB204', ENCODED => > > > > 584dac5cc70d8682f71c4675a843c309, TABLE => {{NAME => 'gs_raw_events', > > > > FAMILIES => [{NAME => 'events', BLOOMFILTER => 'NONE', > > REPLICATION_SCOPE > > > => > > > > '1', VERSIONS => '3', COMPRESSION => 'LZO', TTL => '604800', > BLOCKSIZE > > => > > > > '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}} > > > > 2011-06-29 16:43:59,003 DEBUG > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > Instantiated > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > 2011-06-29 16:43:59,204 DEBUG > > org.apache.hadoop.hbase.regionserver.Store: > > > > loaded > > > > > > > > > > > > > > hdfs://hadoop1-m1:8020/hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360, > > > > isReference=false, isBulkLoadResult=false, seqid=1162228062, > > > > majorCompaction=false > > > > 2011-06-29 16:43:59,205 INFO > > > org.apache.hadoop.hbase.regionserver.HRegion: > > > > Onlined > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309.; > > > > next sequenceid=1162228063 > > > > 2011-06-29 16:43:59,205 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x23004a31d8904de Attempting to transition node > > > > 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > > > RS_ZK_REGION_OPENING > > > > 2011-06-29 16:43:59,212 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x23004a31d8904de Successfully transitioned node > > > > 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > > > RS_ZK_REGION_OPENING > > > > 2011-06-29 16:43:59,214 INFO > > org.apache.hadoop.hbase.catalog.MetaEditor: > > > > Updated row > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > in region .META.,,1 with server=hadoop1-s02.farm-ny.gigya.com:60020, > > > > startcode=1306919627544 > > > > 2011-06-29 16:43:59,214 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x23004a31d8904de Attempting to transition node > > > > 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > > > RS_ZK_REGION_OPENED > > > > 2011-06-29 16:43:59,224 DEBUG > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > regionserver:60020-0x23004a31d8904de Successfully transitioned node > > > > 584dac5cc70d8682f71c4675a843c309 from RS_ZK_REGION_OPENING to > > > > RS_ZK_REGION_OPENED > > > > 2011-06-29 16:43:59,224 DEBUG > > > > org.apache.hadoop.hbase.regionserver.handler.OpenRegionHandler: > Opened > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > java.io.IOException: Got error in response to OP_READ_BLOCK self=/ > > > > 10.1.104.2:33356, remote=/10.1.104.2:50010 for file > > > > > > > > > > > > > > /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360 > > > > for block 3674866614142268536_674205 > > > > > > > > > > > > Ted, can you please point me to J-D's bug fix you mentioned? Are you > > > > positive it's the same scenario - data loss is a very serious problem > > for > > > a > > > > DB. > > > > I'd really like to apply that patch ASAP, because when I run hbck I > get > > > > over > > > > 400 regions which are multiply assigned. > > > > Last question, I understand the region's data is lost but is there a > > way > > > to > > > > at least make the table consistent again by some how removing the > lost > > > > region? > > > > > > > > Thanks. > > > > > > > > -eran > > > > > > > > > > > > > > > > On Sat, Jul 2, 2011 at 01:46, Ted Yu <yuzhih...@gmail.com> wrote: > > > > > > > > > >> 2011-06-29 16:43:57,880 INFO > > > > > org.apache.hadoop.hbase. > > > > > master.AssignmentManager: Region has been > > > > > PENDING_OPEN for too long, reassigning > > > > > > > > > > > > > > > > > > > > region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > > > > The double assignment should have been fixed by J-D's recent > checkin. > > > > > > > > > > On Fri, Jul 1, 2011 at 3:14 PM, Stack <st...@duboce.net> wrote: > > > > > > > > > > > Is > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > the region that was having the issue? If so, if you looked in > > > > > > hadoop1-s05's logs, was this region opened around 2011-06-29 > > > 16:43:57? > > > > > > Was it also opened hadoop1-s02 not long after? Did you say what > > > > > > version of hbase you are on? > > > > > > > > > > > > St.Ack > > > > > > > > > > > > On Fri, Jul 1, 2011 at 5:08 AM, Eran Kutner <e...@gigya.com> > > wrote: > > > > > > > Hi Stack, > > > > > > > I'm not sure what the log means. I do see references to two > > > different > > > > > > > servers, but that would probably happen if there was normal > > > > transition > > > > > I > > > > > > > assume. I'm using version 0.90.3 > > > > > > > Here are the relevant lines from the master logs: > > > > > > > > > > > > > > 2011-06-19 21:39:37,164 INFO > > > > > > org.apache.hadoop.hbase.master.ServerManager: > > > > > > > Received REGION_SPLIT: > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658.: > > > > > > > Daughters; > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518810_1249_WEB204,1308533970928.46f876a4e97be04edb35eb8f8959d482. > > > > > > > from hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 > > > > > > > 2011-06-19 21:43:12,983 INFO > > > > > org.apache.hadoop.hbase.catalog.MetaEditor: > > > > > > > Deleted daughter reference > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > > > > > > qualifier=splitA, from parent > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533691659.9000a5d8df9502efc90d2c23567e4658. > > > > > > > 2011-06-29 16:29:36,143 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Starting > > > > unassignment > > > > > > of > > > > > > > region > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > (offlining) > > > > > > > 2011-06-29 16:29:36,146 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Sent CLOSE to > > > > > > serverName= > > > > > > > hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, > > > load=(requests=0, > > > > > > > regions=1654, usedHeap=1870, maxHeap=12483) for region > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > 2011-06-29 16:29:38,327 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling new > > > > > unassigned > > > > > > > node: /hbase/unassigned/584dac5cc70d8682f71c4675a843c309 > > > > > > > > > > > > > > > > > > > > > > > > > > > > (region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > > > > > > server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, > > > > > > > state=RS_ZK_REGION_CLOSED) > > > > > > > 2011-06-29 16:29:38,327 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling > > > > > > > transition=RS_ZK_REGION_CLOSED, > > > > > > > server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, > > > > > > > region=584dac5cc70d8682f71c4675a843c309 > > > > > > > 2011-06-29 16:30:53,742 DEBUG > > > > > > > org.apache.hadoop.hbase.master.handler.ClosedRegionHandler: > > > Handling > > > > > > CLOSED > > > > > > > event for 584dac5cc70d8682f71c4675a843c309 > > > > > > > 2011-06-29 16:30:53,742 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Table being > > > > disabled > > > > > so > > > > > > > deleting ZK node and removing from regions in transition, > > skipping > > > > > > > assignment of region > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > 2011-06-29 16:30:53,742 DEBUG > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > > > master:60000-0x13004a31d7804c4 Deleting existing unassigned > node > > > for > > > > > > > 584dac5cc70d8682f71c4675a843c309 that is in expected state > > > > > > > RS_ZK_REGION_CLOSED > > > > > > > 2011-06-29 16:30:53,801 DEBUG > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > > > master:60000-0x13004a31d7804c4 Successfully deleted unassigned > > node > > > > for > > > > > > > region 584dac5cc70d8682f71c4675a843c309 in expected state > > > > > > > RS_ZK_REGION_CLOSED > > > > > > > 2011-06-29 16:34:01,453 INFO > > > > > org.apache.hadoop.hbase.catalog.MetaEditor: > > > > > > > Updated region > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > in META > > > > > > > 2011-06-29 16:37:12,247 DEBUG > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > > > master:60000-0x13004a31d7804c4 Creating (or updating) > unassigned > > > node > > > > > for > > > > > > > 584dac5cc70d8682f71c4675a843c309 with OFFLINE state > > > > > > > 2011-06-29 16:37:12,576 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: No previous > > > > > transition > > > > > > > plan was found (or we are ignoring an existing plan) for > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > so generated a random one; > > > > > > > > > > > > > > > > > > > > > > > > > > > > hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > > > > > > src=, dest=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076; > 5 > > > > > > (online=5, > > > > > > > exclude=null) available servers > > > > > > > 2011-06-29 16:37:12,576 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning > > region > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 > > > > > > > 2011-06-29 16:37:13,102 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling > > > > > > > transition=RS_ZK_REGION_OPENED, > > > > > > > server=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, > > > > > > > region=584dac5cc70d8682f71c4675a843c309 > > > > > > > 2011-06-29 16:39:54,075 DEBUG > > > > > > > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: > > > Handling > > > > > > OPENED > > > > > > > event for 584dac5cc70d8682f71c4675a843c309; deleting unassigned > > > node > > > > > > > 2011-06-29 16:39:54,075 DEBUG > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > > > master:60000-0x13004a31d7804c4 Deleting existing unassigned > node > > > for > > > > > > > 584dac5cc70d8682f71c4675a843c309 that is in expected state > > > > > > > RS_ZK_REGION_OPENED > > > > > > > 2011-06-29 16:39:54,192 DEBUG > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > > > master:60000-0x13004a31d7804c4 Successfully deleted unassigned > > node > > > > for > > > > > > > region 584dac5cc70d8682f71c4675a843c309 in expected state > > > > > > > RS_ZK_REGION_OPENED > > > > > > > 2011-06-29 16:39:54,326 DEBUG > > > > > > > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: > > Opened > > > > > region > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > on hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 > > > > > > > 2011-06-29 16:40:00,598 DEBUG > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > > > master:60000-0x13004a31d7804c4 Creating (or updating) > unassigned > > > node > > > > > for > > > > > > > 584dac5cc70d8682f71c4675a843c309 with OFFLINE state > > > > > > > 2011-06-29 16:40:00,877 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: No previous > > > > > transition > > > > > > > plan was found (or we are ignoring an existing plan) for > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > so generated a random one; > > > > > > > > > > > > > > > > > > > > > > > > > > > > hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > > > > > > src=, dest=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076; > 5 > > > > > > (online=5, > > > > > > > exclude=null) available servers > > > > > > > 2011-06-29 16:40:00,877 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning > > region > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > to hadoop1-s05.farm-ny.gigya.com,60020,1307349217076 > > > > > > > 2011-06-29 16:43:57,879 INFO > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Regions in > > > > transition > > > > > > > timed out: > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > state=PENDING_OPEN, ts=1309380052723 > > > > > > > 2011-06-29 16:43:57,880 INFO > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Region has > been > > > > > > > PENDING_OPEN for too long, reassigning > > > > > > > > > > > > > > > > > > > > > > > > > > > > region=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > 2011-06-29 16:43:57,936 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Forcing > > OFFLINE; > > > > > > > > > > > > > > > > > > > > > > > > > > > > was=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > state=PENDING_OPEN, ts=1309380052723 > > > > > > > 2011-06-29 16:43:57,936 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: No previous > > > > > transition > > > > > > > plan was found (or we are ignoring an existing plan) for > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > so generated a random one; > > > > > > > > > > > > > > > > > > > > > > > > > > > > hri=gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309., > > > > > > > src=, dest=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544; > 5 > > > > > > (online=5, > > > > > > > exclude=null) available servers > > > > > > > 2011-06-29 16:43:57,936 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Assigning > > region > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > to hadoop1-s02.farm-ny.gigya.com,60020,1306919627544 > > > > > > > 2011-06-29 16:43:59,022 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling > > > > > > > transition=RS_ZK_REGION_OPENING, > > > > > > > server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, > > > > > > > region=584dac5cc70d8682f71c4675a843c309 > > > > > > > 2011-06-29 16:43:59,221 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling > > > > > > > transition=RS_ZK_REGION_OPENING, > > > > > > > server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, > > > > > > > region=584dac5cc70d8682f71c4675a843c309 > > > > > > > 2011-06-29 16:43:59,226 DEBUG > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Handling > > > > > > > transition=RS_ZK_REGION_OPENED, > > > > > > > server=hadoop1-s02.farm-ny.gigya.com,60020,1306919627544, > > > > > > > region=584dac5cc70d8682f71c4675a843c309 > > > > > > > 2011-06-29 16:43:59,274 DEBUG > > > > > > > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: > > > Handling > > > > > > OPENED > > > > > > > event for 584dac5cc70d8682f71c4675a843c309; deleting unassigned > > > node > > > > > > > 2011-06-29 16:43:59,274 DEBUG > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > > > master:60000-0x13004a31d7804c4 Deleting existing unassigned > node > > > for > > > > > > > 584dac5cc70d8682f71c4675a843c309 that is in expected state > > > > > > > RS_ZK_REGION_OPENED > > > > > > > 2011-06-29 16:43:59,296 DEBUG > > > > > org.apache.hadoop.hbase.zookeeper.ZKAssign: > > > > > > > master:60000-0x13004a31d7804c4 Successfully deleted unassigned > > node > > > > for > > > > > > > region 584dac5cc70d8682f71c4675a843c309 in expected state > > > > > > > RS_ZK_REGION_OPENED > > > > > > > 2011-06-29 16:43:59,375 WARN > > > > > > > org.apache.hadoop.hbase.master.AssignmentManager: Overwriting > > > > > > > 584dac5cc70d8682f71c4675a843c309 on > > > > > > > serverName=hadoop1-s05.farm-ny.gigya.com,60020,1307349217076, > > > > > > > load=(requests=0, regions=1273, usedHeap=2676, maxHeap=12483) > > > > > > > 2011-06-29 16:43:59,375 DEBUG > > > > > > > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: > > Opened > > > > > region > > > > > > > > > > > > > > > > > > > > > > > > > > > > gs_raw_events,GSLoad_1308518553_168_WEB204,1308533970928.584dac5cc70d8682f71c4675a843c309. > > > > > > > on hadoop1-s02.farm-ny.gigya.com,60020,1306919627544 > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > -eran > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Jul 1, 2011 at 09:05, Stack <st...@duboce.net> wrote: > > > > > > > > > > > > > >> So, Eran, it seems as though two RegionServers were carrying > the > > > > > > >> region? One deleted a file (compaction on its side)? Can you > > > > figure > > > > > > >> if indeed two servers had same region? (Check master logs for > > > this > > > > > > >> regions assignments). > > > > > > >> > > > > > > >> What version of hbase? > > > > > > >> > > > > > > >> St.Ack > > > > > > >> > > > > > > >> > > > > > > >> On Thu, Jun 30, 2011 at 3:58 AM, Eran Kutner <e...@gigya.com> > > > > wrote: > > > > > > >> > Hi, > > > > > > >> > I have a cluster of 5 nodes with one large table that > > currently > > > > has > > > > > > >> around > > > > > > >> > 12000 regions. Everything was working fine for relatively > long > > > > time, > > > > > > >> until > > > > > > >> > now. > > > > > > >> > Yesterday I significantly reduced the TTL on the table and > > > > initiated > > > > > > >> major > > > > > > >> > compaction. This should have reduced the table size to about > > 20% > > > > of > > > > > > its > > > > > > >> > original size. > > > > > > >> > Today, I'm getting errors of inaccessible files on HDFS, for > > > > > example: > > > > > > >> > java.io.IOException: Got error in response to OP_READ_BLOCK > > > self=/ > > > > > > >> > 10.1.104.2:58047, remote=/10.1.104.2:50010 for file > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/events/1971818821800304360 > > > > > > >> > for block 3674866614142268536_674205 > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$BlockReader.newBlockReader(DFSClient.java:1487) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1811) > > > > > > >> > at > > > > > > >> > > > > > > > > > > > > > org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1948) > > > > > > >> > at > > java.io.DataInputStream.read(DataInputStream.java:132) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.io.hfile.BoundedRangeFileInputStream.read(BoundedRangeFileInputStream.java:105) > > > > > > >> > at > > > > > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > > > > >> > at > > > > > > java.io.BufferedInputStream.read(BufferedInputStream.java:237) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.io.compress.BlockDecompressorStream.rawReadInt(BlockDecompressorStream.java:128) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:68) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) > > > > > > >> > at > > > > > > java.io.BufferedInputStream.read1(BufferedInputStream.java:256) > > > > > > >> > at > > > > > > java.io.BufferedInputStream.read(BufferedInputStream.java:317) > > > > > > >> > at > > > org.apache.hadoop.io.IOUtils.readFully(IOUtils.java:102) > > > > > > >> > at > > > > > > >> > > > > > > > > > > > > > org.apache.hadoop.hbase.io.hfile.HFile$Reader.decompress(HFile.java:1094) > > > > > > >> > at > > > > > > >> > > > > > > > > > > > > > org.apache.hadoop.hbase.io.hfile.HFile$Reader.readBlock(HFile.java:1036) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.io.hfile.HFile$Reader$Scanner.seekTo(HFile.java:1433) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekAtOrAfter(StoreFileScanner.java:139) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seek(StoreFileScanner.java:96) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.StoreScanner.<init>(StoreScanner.java:77) > > > > > > >> > at > > > > > > >> > > > > > > > > org.apache.hadoop.hbase.regionserver.Store.getScanner(Store.java:1341) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion$RegionScanner.<init>(HRegion.java:2269) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.instantiateInternalScanner(HRegion.java:1126) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1118) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegion.getScanner(HRegion.java:1102) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.regionserver.HRegionServer.openScanner(HRegionServer.java:1781) > > > > > > >> > at > sun.reflect.GeneratedMethodAccessor46.invoke(Unknown > > > > > Source) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > > > > > >> > at java.lang.reflect.Method.invoke(Method.java:597) > > > > > > >> > at > > > > > > >> > > > > > org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) > > > > > > >> > at > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) > > > > > > >> > > > > > > > >> > I checked and the file, indeed doesn't exist on HDFS, here > is > > > the > > > > > name > > > > > > >> node > > > > > > >> > logs for this block, apparently because it was deleted: > > > > > > >> > 2011-06-19 21:39:36,651 INFO > > org.apache.hadoop.hdfs.StateChange: > > > > > > BLOCK* > > > > > > >> > NameSystem.allocateBlock: > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > /hbase/gs_raw_events/584dac5cc70d8682f71c4675a843c309/.tmp/2096863423111131624. > > > > > > >> > blk_3674866614142268536_674205 > > > > > > >> > 2011-06-19 21:40:11,954 INFO > > org.apache.hadoop.hdfs.StateChange: > > > > > > BLOCK* > > > > > > >> > NameSystem.addStoredBlock: blockMap updated: > 10.1.104.2:50010 > > is > > > > > > added > > > > > > >> to > > > > > > >> > blk_3674866614142268536_674205 size 67108864 > > > > > > >> > 2011-06-19 21:40:11,954 INFO > > org.apache.hadoop.hdfs.StateChange: > > > > > > BLOCK* > > > > > > >> > NameSystem.addStoredBlock: blockMap updated: > 10.1.104.3:50010 > > is > > > > > > added > > > > > > >> to > > > > > > >> > blk_3674866614142268536_674205 size 67108864 > > > > > > >> > 2011-06-19 21:40:11,955 INFO > > org.apache.hadoop.hdfs.StateChange: > > > > > > BLOCK* > > > > > > >> > NameSystem.addStoredBlock: blockMap updated: > 10.1.104.5:50010 > > is > > > > > > added > > > > > > >> to > > > > > > >> > blk_3674866614142268536_674205 size 67108864 > > > > > > >> > 2011-06-29 20:20:01,662 INFO > > org.apache.hadoop.hdfs.StateChange: > > > > > > BLOCK* > > > > > > >> ask > > > > > > >> > 10.1.104.2:50010 to delete blk_3674866614142268536_674205 > > > > > > >> > 2011-06-29 20:20:13,671 INFO > > org.apache.hadoop.hdfs.StateChange: > > > > > > BLOCK* > > > > > > >> ask > > > > > > >> > 10.1.104.5:50010 to delete blk_-4056387895369608597_675174 > > > > > > >> > blk_-5017882805850873821_672281 > blk_702373987100607684_672288 > > > > > > >> > blk_-5357157478043290010_668506 > blk_7118175133735412789_674903 > > > > > > >> > blk_-3569812563715986384_675231 > blk_8296855057240604851_669285 > > > > > > >> > blk_-6483679172530609101_674268 > blk_8738539715363739108_673682 > > > > > > >> > blk_1744841904626813502_675238 > blk_-6035315106100051103_674266 > > > > > > >> > blk_-1789501623010070237_674908 > blk_1944054629336265129_673689 > > > > > > >> > blk_3674866614142268536_674205 > blk_7930425446738143892_647410 > > > > > > >> > blk_-3007186753042268449_669296 > > blk_-5482302621772778061_647416 > > > > > > >> > blk_-3765735404924932181_672004 > blk_7476090998956811081_675169 > > > > > > >> > blk_7862291659285127712_646890 > blk_-2666244746343584727_672013 > > > > > > >> > blk_6039172613960915602_674206 > blk_-8470884397893086564_646899 > > > > > > >> > blk_4558230221166712802_668510 > > > > > > >> > 2011-06-29 20:20:46,698 INFO > > org.apache.hadoop.hdfs.StateChange: > > > > > > BLOCK* > > > > > > >> ask > > > > > > >> > 10.1.104.3:50010 to delete blk_-7851606440036350812_671552 > > > > > > >> > blk_9214649160203453845_647566 blk_702373987100607684_672288 > > > > > > >> > blk_5958099369749234073_668143 > blk_-5172218034084903173_673109 > > > > > > >> > blk_-2934555181472719276_646476 > > blk_-1409986679370073931_672552 > > > > > > >> > blk_-2786034325506235869_669086 > blk_3674866614142268536_674205 > > > > > > >> > blk_510158930393283118_673225 blk_916244738216205237_677068 > > > > > > >> > blk_-4317027806407316617_670379 > blk_8555705688850972639_673485 > > > > > > >> > blk_-3765735404924932181_672004 > > blk_-5482302621772778061_647416 > > > > > > >> > blk_-2461801145731752623_674605 > > blk_-8737702908048998927_672549 > > > > > > >> > blk_-8470884397893086564_646899 > blk_4558230221166712802_668510 > > > > > > >> > blk_-4056387895369608597_675174 > > blk_-8675430610673886073_647695 > > > > > > >> > blk_-6642870230256028318_668211 > > blk_-3890408516362176771_677483 > > > > > > >> > blk_-3569812563715986384_675231 > > blk_-5007142629771321873_674548 > > > > > > >> > blk_-3345355191863431669_667066 > blk_8296855057240604851_669285 > > > > > > >> > blk_-6595462308187757470_672420 > > blk_-2583945228783203947_674607 > > > > > > >> > blk_-346988625120916345_677063 > blk_4449525876338684218_674496 > > > > > > >> > blk_2617172363857549730_668201 > blk_8738539715363739108_673682 > > > > > > >> > blk_-208904675456598428_679286 > blk_-497549341281882641_646477 > > > > > > >> > blk_-6035315106100051103_674266 > > blk_-2356539038067297411_672388 > > > > > > >> > blk_-3881703084497103249_668137 > blk_2214397881104950315_646643 > > > > > > >> > blk_-5907671443455357710_673223 > > blk_-2431880309956605679_669204 > > > > > > >> > blk_6039172613960915602_674206 > blk_5053643911633142711_669194 > > > > > > >> > blk_-2636977729205236686_674664 > > > > > > >> > > > > > > > >> > I assume the file loss is somehow related to this change and > > the > > > > > major > > > > > > >> > compaction that followed because the same scan that is > failing > > > now > > > > > was > > > > > > >> > working fine yesterday and that is the only changed that > > > happened > > > > on > > > > > > the > > > > > > >> > cluster. Any suggestions what to do now? > > > > > > >> > > > > > > > >> > Thanks. > > > > > > >> > > > > > > > >> > -eran > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >