Just fyi there are known bugs in 0.20.0, I strongly urge you to get on 0.20.3 asap, either on your own or as soon as CDH includes it.
On Wed, Jan 27, 2010 at 2:41 PM, James Baldassari <ja...@dataxu.com> wrote: > Thank you for the suggestions. I think we have managed to resolve the > problem. Due to our tight deadline on this project we weren't able to > change one parameter, retest, and then change another, so I'm not sure > exactly which change(s) fixed the problem. > > First we shut down the master and all region servers and then manually > removed the /hbase root through hadoop/HDFS. One of my colleagues > increased some timeout values (I think they were ZooKeeper timeouts). > Another change was that I recreated the table without LZO compression > and without setting the IN_MEMORY flag. I learned that we did not have > the LZO libraries installed, and the table had been created originally > with compression set to LZO, so I imagine that would cause problems. I > didn't see any errors about it in the logs, however. Maybe this > explains why we lost data during our initial testing after shutting down > HBase. Perhaps it was unable to write the data to HDFS because the LZO > libraries were not available? > > Anyway, everything seems to be ok for now. We can restart HBase without > data loss or errors, and we can truncate the table without any problems. > If any other issues crop up we plan on upgrading to 0.20.3, but our > preference is to stay with the Cloudera distro if we can. We're doing > additional testing tonight with a larger dataset, so I'll keep an eye on > it and post back if we learn anything new. > > Thanks again for your help. > > -James > > > On Wed, 2010-01-27 at 13:54 -0600, Stack wrote: >> On Tue, Jan 26, 2010 at 9:03 PM, James Baldassari <ja...@dataxu.com> wrote: >> > >> > After running a map/reduce job which inserted around 180,000 rows into >> > HBase, HBase appeared to be fine. We could do a count on our table, and >> > no errors were reported. We then tried to truncate the table in >> > preparation for another test but were unable to do so because the region >> > became stuck in a transition state. >> >> Yes. In older hbase, truncate of > small tables was flakey. Its >> better in 0.20.3 (I wrote our brothers over at Cloudera about updating >> version they bundle especially since 0.20.3 just went out). >> >> I restarted each region server >> > individually, but it did not fix the problem. I tried the >> > disable_region and close_region commands from the hbase shell, but that >> > didn't work either. After doing all of that, a status 'detailed' showed >> > this: >> > >> > 1 regionsInTransition >> > name=retargeting,,1264546222144, unassigned=false, pendingOpen=false, >> > open=false, closing=true, pendingClose=false, closed=false, offlined=false >> > >> > Then I restarted the master and all region servers, and it looked like >> > this: >> > >> > 1 regionsInTransition >> > name=retargeting,,1264546222144, unassigned=false, pendingOpen=true, >> > open=false, closing=false, pendingClose=false, closed=false, offlined=false >> >> >> Even after a master restart? Above is dump of a master internal >> datastructure that is kept in-memory. Strange that it would pick up >> same exact state on restart (As Ryan says, a restart of the master >> alone is usually a radical but sufficient fix). >> >> I was going to say that you try onlining the individual region in the >> shell but I don't think that'll work either, not unless you update to >> 0.20.3 era hbase. >> >> > >> > I noticed messages in some of the region server logs indicating that >> > their zookeeper sessions had expired. I'm not sure if this has anything >> > to do with the problem. >> >> It could. The regionservers will restart if their session w/ zk >> expires. Whats your hbase schema like? How are you doing your >> upload? >> >> I should mention that this scenario is quite >> > repeatable, and the last few times it has happened we had to shut down >> > HBase and manually remove the /hbase root from HDFS, then start HBase >> > and recreate the table. >> > >> For sure you've upped file descriptors and xceiver params as per the >> Getting Started? >> >> > >> > I was also wondering whether it was normal for there to be only one >> > region with 180,000+ rows. Shouldn't this region be split into several >> > regions and distributed among the region servers? I'm new to HBase, so >> > maybe my understanding of how it's supposed to work is wrong. >> >> Get the regions size on the filesystem: ./bin/hadoop fs -dus >> /hbase/table/regionname. Region splits when its above a size >> threshold, 256M usually. >> >> St.Ack >> >> > >> > Thanks, >> > James >> > >> > >> > > >