[slurm-dev] Re: SLURM ERROR! NEED HELP

Said Mohamed Said Thu, 06 Jul 2017 07:22:13 -0700

Yes Nielsen, my biggest mistake was assuming that iptables wont be used in 
CentoOS 7.
But I am relieved that I found the solution with the help of all of you. I 
really Appreciate.


Best Regards,
Said.
________________________________
From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
Sent: Thursday, July 6, 2017 9:11:54 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP


Firewall problems, like I suggested initially!  Nmap is a great tool for
probing open ports!

The iptables *must not* be configured on CentOS 7, you *must* use
firewalld.  See
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
for Slurm configurations.

/Ole

On 07/06/2017 01:22 PM, Said Mohamed Said wrote:
> John and Others,
>
>
> Thank you very much for your support. The problem is finally solved.
>
>
> After Installing nmap, it let me realize that some ports were blocked
> even with firewall daemon stopped and disabled. Turned out that iptables
> was on and enabled. After stopping iptables everything work just fine.
>
>
>
> Best Regards,
>
>
> Said.
>
> ------------------------------------------------------------------------
> *From:* John Hearns <hear...@googlemail.com>
> *Sent:* Thursday, July 6, 2017 6:47:48 PM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
> Said, you are not out of ideas.
>
> I would suggest 'nmap' as a good tool to start with.   Instlal nmap on
> your compute node and see which ports are open on the controller node
>
> Also do we have a DNS name resolution problem here?
> I alwasy remember sun Gridengine as being notoriously sensitive to name
> resolution, and that was my first question when any SGE problem was
> reported.
> So a couple of questions:
>
> On the ocntroller node and on the compute node run this:
> hostname
> hostname -f
>
> Do the cluster controller node or the compute nodes have more than one
> network interface.
> I bet the cluster controller node does!   From the compute node, do an
> nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of
> both of those interfaces.
>
> Also as Rajul says - how are you making sure that both controller and
> compute nodes have the same slurm.conf file
> Actually if the slurm.conf files are different this will eb logged when
> the compute node starts up, but let us check everything.
>
>
>
>
>
>
>
>
>
> On 6 July 2017 at 11:37, Said Mohamed Said <said.moha...@oist.jp
> <mailto:said.moha...@oist.jp>> wrote:
>
>     Even after reinstalling everything from the beginning the problem is
>     still there. Right now I am out of Ideas.
>
>
>
>
>     Best Regards,
>
>
>     Said.
>
>     ------------------------------------------------------------------------
>     *From:* Said Mohamed Said
>     *Sent:* Thursday, July 6, 2017 2:23:05 PM
>     *To:* slurm-dev
>     *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>     Thank you all for your suggestions, the only thing I can do for now
>     is to uninstall and install from the beginning and I will use the
>     most recent version of slurm on both nodes.
>
>     For Felix who asked, the OS is CentOS 7.3 on both machines.
>
>     I will let you know if that can solve the issue.
>     ------------------------------------------------------------------------
>     *From:* Rajul Kumar <kumar.r...@husky.neu.edu
>     <mailto:kumar.r...@husky.neu.edu>>
>     *Sent:* Thursday, July 6, 2017 12:41:51 AM
>     *To:* slurm-dev
>     *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>     Sorry for the typo
>     It's generally when one of the controller or compute can reach the
>     other one but it's *not* happening vice-versa.
>
>
>     On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar
>     <kumar.r...@husky.neu.edu <mailto:kumar.r...@husky.neu.edu>> wrote:
>
>         I came across the same problem sometime back. It's generally
>         when one of the controller or compute can reach to other one but
>         it's happening vice-versa.
>
>         Have a look at the following points:
>         - controller and compute can ping to each other
>         - both share the same slurm.conf
>         - slurm.conf has the location of both controller and compute
>         - slurm services are running on the compute node when the
>         controller says it's down
>         - TCP connections are not being dropped
>         - Ports are accessible that are to be used for communication,
>         specifically response ports
>         - Check the routing rules if any
>         - Clocks are synced across
>         - Hope there isn't any version mismatch but still have a look
>         (doesn't recognize the nodes for major version differences)
>
>         Hope this helps.
>
>         Best,
>         Rajul
>
>         On Wed, Jul 5, 2017 at 10:52 AM, John Hearns
>         <hear...@googlemail.com <mailto:hear...@googlemail.com>> wrote:
>
>             Said,
>                 a problem like this always has a simple cause. We share
>             your frustration, and several people her have offered help.
>             So please do not get discouraged. We have all been in your
>             situation!
>
>             The only way to handle problems like this is
>             a) start at the beginning and read the manuals and webpages
>             closely
>             b) start at the lowest level, ie here the network and do NOT
>             assume that any component is working
>             c) look at all the log files closely
>             d) start daeomon sprocesses in a terminal with any 'verbose'
>             flags set
>             e) then start on more low-level diagnostics, such as tcpdump
>             of network adapters and straces of the processes and gstacks
>
>
>             you have been doing steps a b and c very well
>             I suggest staying with these - I myself am going for Adam
>             Huffmans suggestion of the NTP clock times.
>             Are you SURE that on all nodes you have run the 'date'
>             command and also 'ntpq -p'
>             Are you SURE the master node and the node OBU-N6   are both
>             connecting to an NTP server?   ntpq -p will tell you that
>
>
>             And do not lose heart.  This is how we all learn.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>             On 5 July 2017 at 16:23, Said Mohamed Said
>             <said.moha...@oist.jp <mailto:said.moha...@oist.jp>> wrote:
>
>                 Sinfo -R gives "NODE IS NOT RESPONDING"
>                 ping gives successful results from both nodes
>
>                 I really can not figure out what is causing the problem.
>
>                 Regards,
>                 Said
>                 
> ------------------------------------------------------------------------
>                 *From:* Felix Willenborg
>                 <felix.willenb...@uni-oldenburg.de
>                 <mailto:felix.willenb...@uni-oldenburg.de>>
>                 *Sent:* Wednesday, July 5, 2017 9:07:05 PM
>
>                 *To:* slurm-dev
>                 *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>                 When the nodes change to the down state, what is 'sinfo
>                 -R' saying? Sometimes it gives you a reason for that.
>
>                 Best,
>                 Felix
>
>                 Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
>>                 Thank you Adam, For NTP I did that as well before
>>                 posting but didn't fix the issue.
>>
>>                 Regards,
>>                 Said
>>                 
>> ------------------------------------------------------------------------
>>                 *From:* Adam Huffman <adam.huff...@gmail.com>
>>                 <mailto:adam.huff...@gmail.com>
>>                 *Sent:* Wednesday, July 5, 2017 8:11:03 PM
>>                 *To:* slurm-dev
>>                 *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>>                 I've seen something similar when node clocks were skewed.
>>
>>                 Worth checking that NTP is running and they're all
>>                 synchronised.
>>
>>                 On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said
>>                 <said.moha...@oist.jp> <mailto:said.moha...@oist.jp>
>>                 wrote:
>>                 > Thank you all for suggestions. I turned off firewall on 
>> both machines but
>>                 > still no luck. I can confirm that No managed switch is 
>> preventing the nodes
>>                 > from communicating. If you check the log file, there is 
>> communication for
>>                 > about 4mins and then the node state goes down.
>>                 > Any other idea?
>>                 > ________________________________
>>                 > From: Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk>
>>                 <mailto:ole.h.niel...@fysik.dtu.dk>
>>                 > Sent: Wednesday, July 5, 2017 7:07:15 PM
>>                 > To: slurm-dev
>>                 > Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>>                 >
>>                 >
>>                 > On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>>                 >> in my network I encountered that managed switches were 
>> preventing
>>                 >> necessary network communication between the nodes, on 
>> which SLURM
>>                 >> relies. You should check if you're using managed switches 
>> to connect
>>                 >> nodes to the network and if so, if they're blocking 
>> communication on
>>                 >> slurm ports.
>>                 >
>>                 > Managed switches should permit IP layer 2 traffic just 
>> like unmanaged
>>                 > switches!  We only have managed Ethernet switches, and 
>> they work without
>>                 > problems.
>>                 >
>>                 > Perhaps you meant that Ethernet switches may perform some 
>> firewall
>>                 > functions by themselves?
>>                 >
>>                 > Firewalls must be off between Slurm compute nodes as well 
>> as the
>>                 > controller host.  See
>>                 > 
>> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
>>                 
>> <https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons>
>>                 >
>>                 > /Ole
>
>
>
>

[slurm-dev] Re: SLURM ERROR! NEED HELP

Reply via email to