Re: HBASE-138: Under load, regions become extremely large and eventually cause region servers to become unresponsive

Marc Harris Wed, 13 Feb 2008 19:03:38 -0800

Done, by the way,

On Mon, 2008-02-11 at 13:54 -0800, stack wrote:


> Mind doing a 'select * from .META.;' in the HQL screen on your master?  
> Going by the below, the .META. is corrupt: i.e. restart didn't fix it 
> and its same row that its complaing about.  Stick the resulting page 
> into the issue if you don't mind.
> 
> Thanks,
> St.Ack
> 
> 
> Marc Harris wrote:
> > The client does not try to upload the same row again and again. The
> > hbase client tries a few times internally, but then if the exception
> > gets out to the client application, it is logged and the application
> > moves on. The client application's log (store.log) actually shows some
> > successes in among the failures.
> >   
> > My reading of the log file is not the same as yours. It looks to me as
> > if each row is tried 5 times, throwing WREs each time, before moving on
> > to another row. All the errors do seem to be regarding the same region
> > though
> > ( pagefetch,http://fun.twilightwap.com/rate.asp?joke_id=183&rating=0
> > wap2 20080102055026,1202660655358,
> > startKey='http://fun.twilightwap.com/rate.asp?joke_id=183&rating=0 wap2
> > 20080102055026',
> > getEndKey()='http://fun.twilightwap.com/rate.asp?joke_id=183&rating=0
> > wap2 20080102055026).
> >   
> 
> > I tried stopping the client application, and restarting it at the point
> > where it failed, with no success. I tried restarting the region server
> > and master server too, also without success.
> >
> > - Marc
> >
> > P.S. Should this discussion be happening in JIRA or here or both?
> >
> > On Mon, 2008-02-11 at 11:27 -0800, stack wrote:
> >
> >   
> >> Marc Harris wrote:
> >>     
> >>> Logs sent via yousendit.com.
> >>>
> >>>   
> >>>       
> >> Thanks for the logs.  I took a quick look.  Upload seems to be going a 
> >> long fine until we start getting the WrongRegionException.  In issue 
> >> HBASE-428, you say your client is single-threaded.   Is it think-headed 
> >> too (smile) in that it unrelentingly keeps trying the same row over and 
> >> over?  (The log seems to have prob. w/ same row over and over again).
> >>
> >> Guessing as to what is up, either the client cache of regions is messed 
> >> up or the .META. table has become corrupt somehow -- it doesn't have 
> >> list of all regions (Perhaps it didn't get a split update or some such).
> >>
> >> If the former, I wonder what would happen if you took your load off, 
> >> killed the client, then resumed at the problematic row?  If things 
> >> started to work again, would seem to point at client-side issue.
> >>
> >>     
> >>> Maybe "re-architect" was not an accurate representation of what I am
> >>> doing. We currently do not have a solution that allows us to add rows to
> >>> our system in arbitrary order and then analyze them, either in order or
> >>> using map-reduce. A year or so ago we tried an RDBMS, and based on that
> >>> experience, and some comments from Doug Cutting,decided that an RDBMS
> >>> had no change of being able to support this kind of functionality.
> >>>
> >>> In terms of performance parameters, the 200 rows/sec that was achieved
> >>> for the first 500K rows was quite sufficient. I don't have a good answer
> >>> because after all these rows get loaded there will be numerous
> >>> map/reduce jobs that execute on them. I would guess that some vague
> >>> parameters are:
> >>>
> >>> - In 3 days, load 100Gb of data representing 10M "units" split over 3
> >>> tables each of which is split over 3 column families. Some fraction of
> >>> these "units" will be replacements for existing ones (same key) some
> >>> will be new
> >>> - Several map-reduce jobs that mostly involve reading the data for each
> >>> "unit" and then writing a few small pieces of data (a few bytes) for
> >>> each "unit". Probably some more interesting maps too, but I don't know
> >>> yet.
> >>> - At least 2 map-reduce jobs that delete units.
> >>>   
> >>>       
> >> These numbers look reasonable to me.  Lets try and make it work.
> >>     
> >>> Am I correct when I say that using 4 region servers will just delay the
> >>> problem by a factor of 4, or have I misunderstood the underlying cause?
> >>>
> >>>   
> >>>       
> >> Yes.
> >>
> >> The factor might be > 4 but effectively, if an issue using single 
> >> server, then same issue will arise with N nodes.
> >>
> >> St.Ack
> >>     
> >
> >   
>

Re: HBASE-138: Under load, regions become extremely large and eventually cause region servers to become unresponsive

Reply via email to