On Sat, Apr 16, 2011 at 12:27 AM, Gaojinchao <[email protected]> wrote: > Reproducing operarion(Region has been OPEN for too long): > 1. some region server(has 1882 regions ) crashed and wait balance finished.
Thats a lot of regions. You should up your region size. > 2.stop hbase cluster Had server recovery completed I wonder? Had all regions from crashed server been handed out? > 3.start cluster.(it happened) > Are the opening too long messages coming because we were still assigning regions on shutdown. When we came back up, we carried on w/ assignments but the restart probably took a while on your cluster -- given as you have ~2k regions per server -- and so assignment timed out. Also, on startup, if lots of regions, the bulk assign on startup may not complete before the timer cuts back in and you'll see some timeouts on assignment. Thats ok as long as we eventually assign out all regions. Update to 0.90.2 too? There were some fixes around assignment that may help here. St.Ack > My case is different for yours. > Hmaster had received RS_ZK_REGION_OPENED. But processing OpenedRegionHandler > is slow. > It seems no bugs. I try to change parameter > hbase.master.executor.openregion.threads. > > > 2011-04-16 09:48:48,666 DEBUG > org.apache.hadoop.hbase.master.AssignmentManager: Handling > transition=RS_ZK_REGION_OPENED, server=C4C5.site,60020,1302918210943, > region=00621f2d5f3b0f7949bbe638d78e96a5 > 2011-04-16 09:49:20,770 INFO > org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed > out: ufdrA,5198613861538448,1301646236548.00621f2d5f3b0f7949bbe638d78e96a5. > state=OPEN, ts=1302918528615 > 2011-04-16 09:49:29,371 DEBUG > org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling OPENED > event for 00621f2d5f3b0f7949bbe638d78e96a5; deleting unassigned node > 2011-04-16 09:49:29,371 DEBUG org.apache.hadoop.hbase.zookeeper.ZKAssign: > master:60000-0x2f5bdad8910000-0x2f5bdad8910000-0x2f5bdad8910000 Deleting > existing unassigned node for 00621f2d5f3b0f7949bbe638d78e96a5 that is in > expected state RS_ZK_REGION_OPENED > > -----邮件原件----- > 发件人: [email protected] [mailto:[email protected]] 代表 Jean-Daniel Cryans > 发送时间: 2011年4月15日 3:20 > 收件人: [email protected] > 主题: Re: a lots of error about Region has been OPEN for too long > > Taking a more deeper look at the code, the assumption I had that the > message was sitting in the queue for too long is erroneous. The reason > is that when the master says "Handling > transition=RS_ZK_REGION_OPENED", it updates the timestamp that we use > to keep track of the timeout which means that unless it stayed more > than 60 seconds in the queue (it took only 3 here it seems), the issue > is elsewhere. > > So logically the other reason would be that the master got the > information too late, since we use the timestamp provided by the > region server. Looking in the region server log, I was able to find: > > 2011-04-05 15:11:12,946 DEBUG > org.apache.hadoop.hbase.zookeeper.ZKAssign: > regionserver:60020-0x22e286f0b9c98f7-0x22e286f0b9c98f7 Successfully > transitioned node 0db7d1f58e4fced0a371aded0ddec281 from > RS_ZK_REGION_OPENING to RS_ZK_REGION_OPENED > > This is 1 minute 7 seconds before the master got the message (I > verified that our clocks are in perfect sync). This means this is > another queue that's too slow (the ones that processes all the events > coming from zookeeper). > > I'll keep digging. > > J-D > > On Thu, Apr 14, 2011 at 10:58 AM, Jean-Daniel Cryans > <[email protected]> wrote: >> (please don't leave unrelated discussions at the tail of your emails) >> >> So I thought I never got that issue but wanted to make sure so I >> grepped my logs and indeed saw I got it, so I what I did is that I >> grepped the name of one of the regions that got the issue and looked >> at what was happening at that time (which you should do in the >> future). I see something like this: >> >> 2011-04-05 15:12:19,037 DEBUG >> org.apache.hadoop.hbase.zookeeper.ZKUtil: >> master:60000-0x42ec2cece810b68 Retrieved 115 byte(s) of data from >> znode /prodjobs/unassigned/0db7d1f58e4fced0a371aded0ddec281 and set >> watcher; region=tsdb,�M<��,1297818092053.0db7d1f58e4fced0a371aded0ddec281., >> server=sv4borg36,60020,1300313562191, state=RS_ZK_REGION_OPENED >> 2011-04-05 15:12:19,037 DEBUG >> org.apache.hadoop.hbase.master.AssignmentManager: Handling >> transition=RS_ZK_REGION_OPENED, server=sv4borg36,60020,1300313562191, >> region=0db7d1f58e4fced0a371aded0ddec281 >> ... >> 2011-04-05 15:12:19,585 INFO >> org.apache.hadoop.hbase.master.AssignmentManager: Regions in >> transition timed out: >> tsdb,\x00\x03\xCBM<}\x08\x00\x00\x01\x00\x00\x8A\x00\x00\x1D\x00\x01\xD1,1297818092053.0db7d1f58e4fced0a371aded0ddec281. >> state=OPEN, ts=1302041472920 >> 2011-04-05 15:12:19,585 ERROR >> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN >> for too long, we don't know where region was opened so can't do >> anything >> ... >> 2011-04-05 15:12:22,504 DEBUG >> org.apache.hadoop.hbase.master.handler.OpenedRegionHandler: Handling >> OPENED event for 0db7d1f58e4fced0a371aded0ddec281; deleting unassigned >> node >> >> >> So if I understand this correctly, the master already got the message >> via ZooKeeper but it stayed in a queue for just long enough that the >> RIT timed out and finally the OpenedRegionHandler was able to process >> it. So in the end nothing looks broken, it just means that the master >> is processing a LOT of regions being opened, while it also took the >> region server a long time to get the region opened. >> >> There are currently a few states that don't get refreshed in ZK, for >> example when a region is sitting in the region server's queue of >> regions to be opened. Very often, when there's a lot of regions to >> open (and worse if the RS has to replay recovered edits), some regions >> in that state will timeout. This needs more thinking. >> >> J-D >> >> 2011/4/13 Gaojinchao <[email protected]>: >>> In hbase version 0.90.1 . >>> >>> Is there any experience ? >>> >>> Hmaster Logs : >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,384 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> 2011-04-08 16:33:09,385 ERROR >>> org.apache.hadoop.hbase.master.AssignmentManager: Region has been OPEN for >>> too long, we don't know where region was opened so can't do anything >>> >> >
