Hi all, I am using Monit to monitor hadoop processes and automatically restart them when failed. From time to time, however, a hadoop process (e.g., namenode) runs with the PID, saying 1111, while its pid file (in /var/run/hadoop-hdfs/hadoop-hdfs-namenode.pid) has a different value, saying 1222. Monit assumes that the service is not running and tries to re-run it using the specified command "/sbin/service hadoop-hdfs-namenode start". The problem is that the Namenode is already running (with a different pid from the pid file). Therefore, the service command fails, but it renews the pid file so that the number in this file is just growing again and again... Probably, Monit, after it found the Namenode is not running, relaunches the Namenode multiple times shortly; as a result, the first one goes up but the second one overwrites the pid file. And the launching script also does not seem to have any lock routine to protect the pid file. Is there anyone who had experienced a similar problem? Temporarily, I am using a workaround to stop the process (kill -15 pid) since "service ... stop" also does not work. Best wishes,
趙漢哲 (CHO, Han-Cheol. Ph.D) データ研究室 / 社員 (Data Science Lab. / Data scientist) TEL: 03-5155-1160 (部署代表) FAX: 03-5155-3307 --> 〒150-8510 東京都渋谷区渋谷2-21-1 渋谷ヒカリエ 27階 Email hancheol....@nhn-playart.com Messenger