Re: HBASE-138: Under load, regions become extremely large and eventually cause region servers to become unresponsive

Marc Harris Mon, 11 Feb 2008 13:21:18 -0800

The client does not try to upload the same row again and again. The
hbase client tries a few times internally, but then if the exception
gets out to the client application, it is logged and the application
moves on. The client application's log (store.log) actually shows some
successes in among the failures.


My reading of the log file is not the same as yours. It looks to me as
if each row is tried 5 times, throwing WREs each time, before moving on
to another row. All the errors do seem to be regarding the same region
though
( pagefetch,http://fun.twilightwap.com/rate.asp?joke_id=183&rating=0
wap2 20080102055026,1202660655358,
startKey='http://fun.twilightwap.com/rate.asp?joke_id=183&rating=0 wap2
20080102055026',
getEndKey()='http://fun.twilightwap.com/rate.asp?joke_id=183&rating=0
wap2 20080102055026).

I tried stopping the client application, and restarting it at the point
where it failed, with no success. I tried restarting the region server
and master server too, also without success.

- Marc

P.S. Should this discussion be happening in JIRA or here or both?

On Mon, 2008-02-11 at 11:27 -0800, stack wrote:

> Marc Harris wrote:
> > Logs sent via yousendit.com.
> >
> >   
> Thanks for the logs.  I took a quick look.  Upload seems to be going a 
> long fine until we start getting the WrongRegionException.  In issue 
> HBASE-428, you say your client is single-threaded.   Is it think-headed 
> too (smile) in that it unrelentingly keeps trying the same row over and 
> over?  (The log seems to have prob. w/ same row over and over again).
> 
> Guessing as to what is up, either the client cache of regions is messed 
> up or the .META. table has become corrupt somehow -- it doesn't have 
> list of all regions (Perhaps it didn't get a split update or some such).
> 
> If the former, I wonder what would happen if you took your load off, 
> killed the client, then resumed at the problematic row?  If things 
> started to work again, would seem to point at client-side issue.
> 
> > Maybe "re-architect" was not an accurate representation of what I am
> > doing. We currently do not have a solution that allows us to add rows to
> > our system in arbitrary order and then analyze them, either in order or
> > using map-reduce. A year or so ago we tried an RDBMS, and based on that
> > experience, and some comments from Doug Cutting,decided that an RDBMS
> > had no change of being able to support this kind of functionality.
> >
> > In terms of performance parameters, the 200 rows/sec that was achieved
> > for the first 500K rows was quite sufficient. I don't have a good answer
> > because after all these rows get loaded there will be numerous
> > map/reduce jobs that execute on them. I would guess that some vague
> > parameters are:
> >
> > - In 3 days, load 100Gb of data representing 10M "units" split over 3
> > tables each of which is split over 3 column families. Some fraction of
> > these "units" will be replacements for existing ones (same key) some
> > will be new
> > - Several map-reduce jobs that mostly involve reading the data for each
> > "unit" and then writing a few small pieces of data (a few bytes) for
> > each "unit". Probably some more interesting maps too, but I don't know
> > yet.
> > - At least 2 map-reduce jobs that delete units.
> >   
> 
> These numbers look reasonable to me.  Lets try and make it work.
> > Am I correct when I say that using 4 region servers will just delay the
> > problem by a factor of 4, or have I misunderstood the underlying cause?
> >
> >   
> Yes.
> 
> The factor might be > 4 but effectively, if an issue using single 
> server, then same issue will arise with N nodes.
> 
> St.Ack

Re: HBASE-138: Under load, regions become extremely large and eventually cause region servers to become unresponsive

Reply via email to