Re: OOME hell

Andrew Purtell Tue, 02 Dec 2008 11:11:08 -0800

Hi Stack,

> I've not seen it before.  Exception should note the
> file it was trying to read from I'd say at a minimum. 
> Looks like failure trying to read in MapFile(SequenceFile)
> content.   And you've not seen it since the restart?
> (Would be odd that a problematic file would heal itself).


It is odd that the file problem was "healed" automatically.
I'm not sure what to think exactly. Maybe it was a log file
and so the damaged portion was skipped during recover/
restart? Or maybe it was not truly a file problem at all.
Concur that the exception should include the file so better
failure analysis is possible. 

> What about the files you made when crawler had no
> upper-bound on sizes pulled down?  Are they still in your
> hbase?
> 
> Disabling compression brought on a bunch of splits but
> otherwise, it seems to be working?

What I did was 'hadoop fs -rmr /data/hbase' and start over,
without compression or blockcache in the schema. :-) At
least right now data loss like that is only a temporary
inconvenience. That won't be the case much longer. 

Also, now I have a file size limit in place on the crawler.
(Re: hbase-writer patch #6.)

I am still seeing OOME take down region servers. Last night
there were 5 failures in an 8 hour window. With the
exception of the IndexOutOfBounds incident, none of the
failures have needed manual intervention for recovery. But
this is only by luck I suspect. With 2GB heap only the
regionservers have been kind enough to go down on OOME.
:-) I'm going to start collecting logs and heap dumps and
see if I can find something in common therein. 

    - Andy

Re: OOME hell

Reply via email to