Re: Region gets stuck in transition state

Ryan Rawson Wed, 27 Jan 2010 14:50:20 -0800

Just fyi there are known bugs in 0.20.0, I strongly urge you to get on
0.20.3 asap, either on your own or as soon as CDH includes it.




On Wed, Jan 27, 2010 at 2:41 PM, James Baldassari <ja...@dataxu.com> wrote:
> Thank you for the suggestions.  I think we have managed to resolve the
> problem.  Due to our tight deadline on this project we weren't able to
> change one parameter, retest, and then change another, so I'm not sure
> exactly which change(s) fixed the problem.
>
> First we shut down the master and all region servers and then manually
> removed the /hbase root through hadoop/HDFS.  One of my colleagues
> increased some timeout values (I think they were ZooKeeper timeouts).
> Another change was that I recreated the table without LZO compression
> and without setting the IN_MEMORY flag.  I learned that we did not have
> the LZO libraries installed, and the table had been created originally
> with compression set to LZO, so I imagine that would cause problems.  I
> didn't see any errors about it in the logs, however.  Maybe this
> explains why we lost data during our initial testing after shutting down
> HBase.  Perhaps it was unable to write the data to HDFS because the LZO
> libraries were not available?
>
> Anyway, everything seems to be ok for now.  We can restart HBase without
> data loss or errors, and we can truncate the table without any problems.
> If any other issues crop up we plan on upgrading to 0.20.3, but our
> preference is to stay with the Cloudera distro if we can.  We're doing
> additional testing tonight with a larger dataset, so I'll keep an eye on
> it and post back if we learn anything new.
>
> Thanks again for your help.
>
> -James
>
>
> On Wed, 2010-01-27 at 13:54 -0600, Stack wrote:
>> On Tue, Jan 26, 2010 at 9:03 PM, James Baldassari <ja...@dataxu.com> wrote:
>> >
>> > After running a map/reduce job which inserted around 180,000 rows into
>> > HBase, HBase appeared to be fine.  We could do a count on our table, and
>> > no errors were reported.  We then tried to truncate the table in
>> > preparation for another test but were unable to do so because the region
>> > became stuck in a transition state.
>>
>> Yes.  In older hbase, truncate of > small tables was flakey.  Its
>> better in 0.20.3 (I wrote our brothers over at Cloudera about updating
>> version they bundle especially since 0.20.3 just went out).
>>
>>  I restarted each region server
>> > individually, but it did not fix the problem.  I tried the
>> > disable_region and close_region commands from the hbase shell, but that
>> > didn't work either.  After doing all of that, a status 'detailed' showed
>> > this:
>> >
>> > 1 regionsInTransition
>> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=false, 
>> > open=false, closing=true, pendingClose=false, closed=false, offlined=false
>> >
>> > Then I restarted the master and all region servers, and it looked like 
>> > this:
>> >
>> > 1 regionsInTransition
>> >    name=retargeting,,1264546222144, unassigned=false, pendingOpen=true, 
>> > open=false, closing=false, pendingClose=false, closed=false, offlined=false
>>
>>
>> Even after a master restart?  Above is dump of a master internal
>> datastructure that is kept in-memory.  Strange that it would pick up
>> same exact state on restart (As Ryan says, a restart of the master
>> alone is usually a radical but sufficient fix).
>>
>> I was going to say that you try onlining the individual region in the
>> shell but I don't think that'll work either, not unless you update to
>> 0.20.3 era hbase.
>>
>> >
>> > I noticed messages in some of the region server logs indicating that
>> > their zookeeper sessions had expired.  I'm not sure if this has anything
>> > to do with the problem.
>>
>> It could.  The regionservers will restart if their session w/ zk
>> expires.  Whats your hbase schema like?  How are you doing your
>> upload?
>>
>> I should mention that this scenario is quite
>> > repeatable, and the last few times it has happened we had to shut down
>> > HBase and manually remove the /hbase root from HDFS, then start HBase
>> > and recreate the table.
>> >
>> For sure you've upped file descriptors and xceiver params as per the
>> Getting Started?
>>
>> >
>> > I was also wondering whether it was normal for there to be only one
>> > region with 180,000+ rows.  Shouldn't this region be split into several
>> > regions and distributed among the region servers?  I'm new to HBase, so
>> > maybe my understanding of how it's supposed to work is wrong.
>>
>> Get the regions size on the filesystem: ./bin/hadoop fs -dus
>> /hbase/table/regionname.  Region splits when its above a size
>> threshold, 256M usually.
>>
>> St.Ack
>>
>> >
>> > Thanks,
>> > James
>> >
>> >
>> >
>
>

Re: Region gets stuck in transition state

Reply via email to