It could also have been the so called OOM Killer that killed your RS.  See
http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Dec 2, 2014 at 1:37 AM, Liu, Ming (HPIT-GADSC) <ming.l...@hp.com>
wrote:

> Thank you both!
>
> Yes, I can see there is the '.out' file with clear proof of process was
> 'killed'. So we can prove this issue now!
> And it is also true that we must rely on JVM itself for proof that the
> kill operation is due to OOM.
> Thank you both, this is a very good learning.
>
> Thanks,
> Ming
>
> -----Original Message-----
> From: Bharath Vissapragada [mailto:bhara...@cloudera.com]
> Sent: Tuesday, December 02, 2014 2:00 PM
> To: hbase-user
> Subject: Re: how to tell there is a OOM in regionserver
>
> I agree with Otis' response. Adding a few more details, there is a ".out"
>  file in the logs/ directory, that is the stdout for each of these daemons
> and incase of  an OOM crash, it prints something like this
>
> # java.lang.OutOfMemoryError: Java heap space
>
> # -XX:OnOutOfMemoryError="kill -9 %p"
>
> #   Executing /bin/sh -c "kill -9 <pid>"...
>
>
>
> On Tue, Dec 2, 2014 at 11:06 AM, Otis Gospodnetic <
> otis.gospodne...@gmail.com> wrote:
>
> > Hi Ming,
> >
> > 1) There typically is an OOM message from the JVM itself
> >
> > 2) I would monitor the server instead of relying on log messages
> > mentioning OOMs.  For example, in SPM <http://sematext.com/spm/> we
> > have "hearbeat alerts" that tell us when we stop hearing from
> > RegionServers and other types of servers.  It also helps when servers
> > simply die for reasons other than OOM.
> >
> > 3) You could (should?) monitor individual memory pools and possibly
> > set alerts or anomaly detection on those.  If you have that, if there
> > was an OOM, you will typically see one of the memory pools approach
> > 100% utilization.  I personally really like this report in SPM because
> > it gives a bit more insight than just "heap size/utilization".  So I'd
> > point the admin to this sort of monitoring report.
> >
> > 4) High GC counts/time, or jump in those metrics, and then typically
> > also jump in CPU usage is what often precedes OOMs.
> >
> > Otis
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> > On Tue, Dec 2, 2014 at 12:22 AM, Liu, Ming (HPIT-GADSC)
> > <ming.l...@hp.com>
> > wrote:
> >
> > > Hi, all,
> > >
> > > Recently, one of our HBase 0.98.5 instance meet with issues: when
> > > run
> > some
> > > specific workload, all region servers will suddenly shut down at
> > > same
> > time,
> > > but master is still running. When I check the log, in master log, I
> > > can
> > see
> > > messages like
> > > 2014-12-01 08:28:11,072 DEBUG [main-EventThread] master.ServerManager:
> > > Added=n008.cluster,60020,1417413986550 to dead servers, submitted
> > shutdown
> > > handler to be executed meta=false
> > > And on n008, regionserver log file, there is no ERROR message, the
> > > last log entry looks very like a ZooKeeper startup message. The log
> > > just
> > stopped
> > > with that last ZooKeeper startup message, and the Region Server
> > > process
> > was
> > > gone when we check with 'jps'.
> > >
> > > We then increased the heap size of regionserver, and it work fine.
> > > RegionServer no longer disappear. So we doubt there was a Out Of
> > > Memory issue, so the region server processes are killed. But my
> questions are:
> > >
> > > 1.       What log message will indicate there is a OOM? Since the
> region
> > > server is 'kill -9', so I think there is no message can tell this.
> > >
> > > 2.       If there is no typical log message about OOM, then how can an
> > > admin make sure there is a region server OOM happened? We just
> > > guess, but can not make sure. We hope there is a method to tell OOM
> > > occured for
> > sure.
> > >
> > > 3.       Does the Zookeeper message appears every time with
> RegionServer
> > > OOM (if it is a OOM). Or it is just a random event just in our system?
> > >
> > > So in sum, I want to know what is the typical clue that people can
> > > make sure there is a OOM issue in HBase region server?
> > >
> > > Thank you,
> > > Ming
> > >
> >
>
>
>
> --
> Bharath Vissapragada
> <http://www.cloudera.com>
>

Reply via email to