It could also have been the so called OOM Killer that killed your RS. See http://www.oracle.com/technetwork/articles/servers-storage-dev/oom-killer-1911807.html
Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Tue, Dec 2, 2014 at 1:37 AM, Liu, Ming (HPIT-GADSC) <ming.l...@hp.com> wrote: > Thank you both! > > Yes, I can see there is the '.out' file with clear proof of process was > 'killed'. So we can prove this issue now! > And it is also true that we must rely on JVM itself for proof that the > kill operation is due to OOM. > Thank you both, this is a very good learning. > > Thanks, > Ming > > -----Original Message----- > From: Bharath Vissapragada [mailto:bhara...@cloudera.com] > Sent: Tuesday, December 02, 2014 2:00 PM > To: hbase-user > Subject: Re: how to tell there is a OOM in regionserver > > I agree with Otis' response. Adding a few more details, there is a ".out" > file in the logs/ directory, that is the stdout for each of these daemons > and incase of an OOM crash, it prints something like this > > # java.lang.OutOfMemoryError: Java heap space > > # -XX:OnOutOfMemoryError="kill -9 %p" > > # Executing /bin/sh -c "kill -9 <pid>"... > > > > On Tue, Dec 2, 2014 at 11:06 AM, Otis Gospodnetic < > otis.gospodne...@gmail.com> wrote: > > > Hi Ming, > > > > 1) There typically is an OOM message from the JVM itself > > > > 2) I would monitor the server instead of relying on log messages > > mentioning OOMs. For example, in SPM <http://sematext.com/spm/> we > > have "hearbeat alerts" that tell us when we stop hearing from > > RegionServers and other types of servers. It also helps when servers > > simply die for reasons other than OOM. > > > > 3) You could (should?) monitor individual memory pools and possibly > > set alerts or anomaly detection on those. If you have that, if there > > was an OOM, you will typically see one of the memory pools approach > > 100% utilization. I personally really like this report in SPM because > > it gives a bit more insight than just "heap size/utilization". So I'd > > point the admin to this sort of monitoring report. > > > > 4) High GC counts/time, or jump in those metrics, and then typically > > also jump in CPU usage is what often precedes OOMs. > > > > Otis > > -- > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > Solr & Elasticsearch Support * http://sematext.com/ > > > > > > On Tue, Dec 2, 2014 at 12:22 AM, Liu, Ming (HPIT-GADSC) > > <ming.l...@hp.com> > > wrote: > > > > > Hi, all, > > > > > > Recently, one of our HBase 0.98.5 instance meet with issues: when > > > run > > some > > > specific workload, all region servers will suddenly shut down at > > > same > > time, > > > but master is still running. When I check the log, in master log, I > > > can > > see > > > messages like > > > 2014-12-01 08:28:11,072 DEBUG [main-EventThread] master.ServerManager: > > > Added=n008.cluster,60020,1417413986550 to dead servers, submitted > > shutdown > > > handler to be executed meta=false > > > And on n008, regionserver log file, there is no ERROR message, the > > > last log entry looks very like a ZooKeeper startup message. The log > > > just > > stopped > > > with that last ZooKeeper startup message, and the Region Server > > > process > > was > > > gone when we check with 'jps'. > > > > > > We then increased the heap size of regionserver, and it work fine. > > > RegionServer no longer disappear. So we doubt there was a Out Of > > > Memory issue, so the region server processes are killed. But my > questions are: > > > > > > 1. What log message will indicate there is a OOM? Since the > region > > > server is 'kill -9', so I think there is no message can tell this. > > > > > > 2. If there is no typical log message about OOM, then how can an > > > admin make sure there is a region server OOM happened? We just > > > guess, but can not make sure. We hope there is a method to tell OOM > > > occured for > > sure. > > > > > > 3. Does the Zookeeper message appears every time with > RegionServer > > > OOM (if it is a OOM). Or it is just a random event just in our system? > > > > > > So in sum, I want to know what is the typical clue that people can > > > make sure there is a OOM issue in HBase region server? > > > > > > Thank you, > > > Ming > > > > > > > > > -- > Bharath Vissapragada > <http://www.cloudera.com> >