Hi all,
 
I am using Monit to monitor hadoop processes and automatically restart them 
when failed.
 
From time to time, however, a hadoop process (e.g., namenode) runs with the 
PID, saying 1111, while its pid file (in 
/var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid) has a different value, saying 
1222.
Monit assumes that the service is not running and tries to re-run it using the 
specified command "/sbin/service hadoop-hdfs-namenode start".
The problem is that the Namenode is already running (with a different pid from 
the pid file).
Therefore, the service command fails, but it renews the pid file so that the 
number in this file is just growing again and again...
 
Probably, Monit, after it found the Namenode is not running, relaunches the 
Namenode multiple times shortly; as a result, the first one goes up but the 
second one overwrites the pid file.
And the launching script also does not seem to have any lock routine to protect 
the pid file.
 
Is there anyone who had experienced a similar problem?
Temporarily, I am using a workaround to stop the process (kill -15 pid) since 
"service ... stop" also does not work. 
 
Best wishes,



 
 趙漢哲  (CHO, Han-Cheol. Ph.D)
データ研究室   / 社員 (Data Science Lab.   / Data scientist)
TEL: 03-5155-1160 (部署代表)   FAX: 03-5155-3307

  --> 〒150-8510 東京都渋谷区渋谷2-21-1 渋谷ヒカリエ 27階
Email  hancheol....@nhn-playart.com   Messenger   



 

Reply via email to