Done, by the way, On Mon, 2008-02-11 at 13:54 -0800, stack wrote:
> Mind doing a 'select * from .META.;' in the HQL screen on your master? > Going by the below, the .META. is corrupt: i.e. restart didn't fix it > and its same row that its complaing about. Stick the resulting page > into the issue if you don't mind. > > Thanks, > St.Ack > > > Marc Harris wrote: > > The client does not try to upload the same row again and again. The > > hbase client tries a few times internally, but then if the exception > > gets out to the client application, it is logged and the application > > moves on. The client application's log (store.log) actually shows some > > successes in among the failures. > > > > My reading of the log file is not the same as yours. It looks to me as > > if each row is tried 5 times, throwing WREs each time, before moving on > > to another row. All the errors do seem to be regarding the same region > > though > > ( pagefetch,http://fun.twilightwap.com/rate.asp?joke_id=183&rating=0 > > wap2 20080102055026,1202660655358, > > startKey='http://fun.twilightwap.com/rate.asp?joke_id=183&rating=0 wap2 > > 20080102055026', > > getEndKey()='http://fun.twilightwap.com/rate.asp?joke_id=183&rating=0 > > wap2 20080102055026). > > > > > I tried stopping the client application, and restarting it at the point > > where it failed, with no success. I tried restarting the region server > > and master server too, also without success. > > > > - Marc > > > > P.S. Should this discussion be happening in JIRA or here or both? > > > > On Mon, 2008-02-11 at 11:27 -0800, stack wrote: > > > > > >> Marc Harris wrote: > >> > >>> Logs sent via yousendit.com. > >>> > >>> > >>> > >> Thanks for the logs. I took a quick look. Upload seems to be going a > >> long fine until we start getting the WrongRegionException. In issue > >> HBASE-428, you say your client is single-threaded. Is it think-headed > >> too (smile) in that it unrelentingly keeps trying the same row over and > >> over? (The log seems to have prob. w/ same row over and over again). > >> > >> Guessing as to what is up, either the client cache of regions is messed > >> up or the .META. table has become corrupt somehow -- it doesn't have > >> list of all regions (Perhaps it didn't get a split update or some such). > >> > >> If the former, I wonder what would happen if you took your load off, > >> killed the client, then resumed at the problematic row? If things > >> started to work again, would seem to point at client-side issue. > >> > >> > >>> Maybe "re-architect" was not an accurate representation of what I am > >>> doing. We currently do not have a solution that allows us to add rows to > >>> our system in arbitrary order and then analyze them, either in order or > >>> using map-reduce. A year or so ago we tried an RDBMS, and based on that > >>> experience, and some comments from Doug Cutting,decided that an RDBMS > >>> had no change of being able to support this kind of functionality. > >>> > >>> In terms of performance parameters, the 200 rows/sec that was achieved > >>> for the first 500K rows was quite sufficient. I don't have a good answer > >>> because after all these rows get loaded there will be numerous > >>> map/reduce jobs that execute on them. I would guess that some vague > >>> parameters are: > >>> > >>> - In 3 days, load 100Gb of data representing 10M "units" split over 3 > >>> tables each of which is split over 3 column families. Some fraction of > >>> these "units" will be replacements for existing ones (same key) some > >>> will be new > >>> - Several map-reduce jobs that mostly involve reading the data for each > >>> "unit" and then writing a few small pieces of data (a few bytes) for > >>> each "unit". Probably some more interesting maps too, but I don't know > >>> yet. > >>> - At least 2 map-reduce jobs that delete units. > >>> > >>> > >> These numbers look reasonable to me. Lets try and make it work. > >> > >>> Am I correct when I say that using 4 region servers will just delay the > >>> problem by a factor of 4, or have I misunderstood the underlying cause? > >>> > >>> > >>> > >> Yes. > >> > >> The factor might be > 4 but effectively, if an issue using single > >> server, then same issue will arise with N nodes. > >> > >> St.Ack > >> > > > > >
