Yes Nielsen, my biggest mistake was assuming that iptables wont be used in CentoOS 7. But I am relieved that I found the solution with the help of all of you. I really Appreciate.
Best Regards, Said. ________________________________ From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> Sent: Thursday, July 6, 2017 9:11:54 PM To: slurm-dev Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP Firewall problems, like I suggested initially! Nmap is a great tool for probing open ports! The iptables *must not* be configured on CentOS 7, you *must* use firewalld. See https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons for Slurm configurations. /Ole On 07/06/2017 01:22 PM, Said Mohamed Said wrote: > John and Others, > > > Thank you very much for your support. The problem is finally solved. > > > After Installing nmap, it let me realize that some ports were blocked > even with firewall daemon stopped and disabled. Turned out that iptables > was on and enabled. After stopping iptables everything work just fine. > > > > Best Regards, > > > Said. > > ------------------------------------------------------------------------ > *From:* John Hearns <hear...@googlemail.com> > *Sent:* Thursday, July 6, 2017 6:47:48 PM > *To:* slurm-dev > *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP > Said, you are not out of ideas. > > I would suggest 'nmap' as a good tool to start with. Instlal nmap on > your compute node and see which ports are open on the controller node > > Also do we have a DNS name resolution problem here? > I alwasy remember sun Gridengine as being notoriously sensitive to name > resolution, and that was my first question when any SGE problem was > reported. > So a couple of questions: > > On the ocntroller node and on the compute node run this: > hostname > hostname -f > > Do the cluster controller node or the compute nodes have more than one > network interface. > I bet the cluster controller node does! From the compute node, do an > nslookup or a dig and see what the COMPUTE NODE thinks are hte names of > both of those interfaces. > > Also as Rajul says - how are you making sure that both controller and > compute nodes have the same slurm.conf file > Actually if the slurm.conf files are different this will eb logged when > the compute node starts up, but let us check everything. > > > > > > > > > > On 6 July 2017 at 11:37, Said Mohamed Said <said.moha...@oist.jp > <mailto:said.moha...@oist.jp>> wrote: > > Even after reinstalling everything from the beginning the problem is > still there. Right now I am out of Ideas. > > > > > Best Regards, > > > Said. > > ------------------------------------------------------------------------ > *From:* Said Mohamed Said > *Sent:* Thursday, July 6, 2017 2:23:05 PM > *To:* slurm-dev > *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP > > Thank you all for your suggestions, the only thing I can do for now > is to uninstall and install from the beginning and I will use the > most recent version of slurm on both nodes. > > For Felix who asked, the OS is CentOS 7.3 on both machines. > > I will let you know if that can solve the issue. > ------------------------------------------------------------------------ > *From:* Rajul Kumar <kumar.r...@husky.neu.edu > <mailto:kumar.r...@husky.neu.edu>> > *Sent:* Thursday, July 6, 2017 12:41:51 AM > *To:* slurm-dev > *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP > Sorry for the typo > It's generally when one of the controller or compute can reach the > other one but it's *not* happening vice-versa. > > > On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar > <kumar.r...@husky.neu.edu <mailto:kumar.r...@husky.neu.edu>> wrote: > > I came across the same problem sometime back. It's generally > when one of the controller or compute can reach to other one but > it's happening vice-versa. > > Have a look at the following points: > - controller and compute can ping to each other > - both share the same slurm.conf > - slurm.conf has the location of both controller and compute > - slurm services are running on the compute node when the > controller says it's down > - TCP connections are not being dropped > - Ports are accessible that are to be used for communication, > specifically response ports > - Check the routing rules if any > - Clocks are synced across > - Hope there isn't any version mismatch but still have a look > (doesn't recognize the nodes for major version differences) > > Hope this helps. > > Best, > Rajul > > On Wed, Jul 5, 2017 at 10:52 AM, John Hearns > <hear...@googlemail.com <mailto:hear...@googlemail.com>> wrote: > > Said, > a problem like this always has a simple cause. We share > your frustration, and several people her have offered help. > So please do not get discouraged. We have all been in your > situation! > > The only way to handle problems like this is > a) start at the beginning and read the manuals and webpages > closely > b) start at the lowest level, ie here the network and do NOT > assume that any component is working > c) look at all the log files closely > d) start daeomon sprocesses in a terminal with any 'verbose' > flags set > e) then start on more low-level diagnostics, such as tcpdump > of network adapters and straces of the processes and gstacks > > > you have been doing steps a b and c very well > I suggest staying with these - I myself am going for Adam > Huffmans suggestion of the NTP clock times. > Are you SURE that on all nodes you have run the 'date' > command and also 'ntpq -p' > Are you SURE the master node and the node OBU-N6 are both > connecting to an NTP server? ntpq -p will tell you that > > > And do not lose heart. This is how we all learn. > > > > > > > > > > > > > > > > > > On 5 July 2017 at 16:23, Said Mohamed Said > <said.moha...@oist.jp <mailto:said.moha...@oist.jp>> wrote: > > Sinfo -R gives "NODE IS NOT RESPONDING" > ping gives successful results from both nodes > > I really can not figure out what is causing the problem. > > Regards, > Said > > ------------------------------------------------------------------------ > *From:* Felix Willenborg > <felix.willenb...@uni-oldenburg.de > <mailto:felix.willenb...@uni-oldenburg.de>> > *Sent:* Wednesday, July 5, 2017 9:07:05 PM > > *To:* slurm-dev > *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP > When the nodes change to the down state, what is 'sinfo > -R' saying? Sometimes it gives you a reason for that. > > Best, > Felix > > Am 05.07.2017 um 13:16 schrieb Said Mohamed Said: >> Thank you Adam, For NTP I did that as well before >> posting but didn't fix the issue. >> >> Regards, >> Said >> >> ------------------------------------------------------------------------ >> *From:* Adam Huffman <adam.huff...@gmail.com> >> <mailto:adam.huff...@gmail.com> >> *Sent:* Wednesday, July 5, 2017 8:11:03 PM >> *To:* slurm-dev >> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP >> >> I've seen something similar when node clocks were skewed. >> >> Worth checking that NTP is running and they're all >> synchronised. >> >> On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said >> <said.moha...@oist.jp> <mailto:said.moha...@oist.jp> >> wrote: >> > Thank you all for suggestions. I turned off firewall on >> both machines but >> > still no luck. I can confirm that No managed switch is >> preventing the nodes >> > from communicating. If you check the log file, there is >> communication for >> > about 4mins and then the node state goes down. >> > Any other idea? >> > ________________________________ >> > From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk> >> <mailto:ole.h.niel...@fysik.dtu.dk> >> > Sent: Wednesday, July 5, 2017 7:07:15 PM >> > To: slurm-dev >> > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP >> > >> > >> > On 07/05/2017 11:40 AM, Felix Willenborg wrote: >> >> in my network I encountered that managed switches were >> preventing >> >> necessary network communication between the nodes, on >> which SLURM >> >> relies. You should check if you're using managed switches >> to connect >> >> nodes to the network and if so, if they're blocking >> communication on >> >> slurm ports. >> > >> > Managed switches should permit IP layer 2 traffic just >> like unmanaged >> > switches! We only have managed Ethernet switches, and >> they work without >> > problems. >> > >> > Perhaps you meant that Ethernet switches may perform some >> firewall >> > functions by themselves? >> > >> > Firewalls must be off between Slurm compute nodes as well >> as the >> > controller host. See >> > >> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons >> >> <https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons> >> > >> > /Ole > > > >