Hi Richard, I think that this is a torque+maui configuration issue on your cluster.
Can you please make sure that your configurations of torque and maui are setup properly? I hope that you can find the torque and maui admin manual on google. One thing that I would like to play with is to see what log messages are generated on the server and client sides when a new job is submitted. That would show many hints on your problem. Regards, -- - DongInn > On May 31, 2016, at 8:40 PM, Richard Young <[email protected]> wrote: > > Lahaye > - No I can't see the ganglia web interface on either the public or private > interfaces, it says "you have no permission" > - the admin node is setup as a forwarding dns server and lookups seem to work > correctly > - the firewall/iptables services have been stopped, with on an iptables rule > set from the command line to forward and NAT traffic > - nscd cache has been turned off > - munge is running > - torque/maui packages did get updated, configurations have been check to > make certain they were the same as before the update. > > Thanks > > --------------------------------------------------------------------- > Richard A. Young > ICT Services > Email: [email protected] Phone: (07) 46315557 > Mob: 0437544370 Fax: (07) 46312798 > --------------------------------------------------------------------- > > -----Original Message----- > From: LAHAYE Olivier [mailto:[email protected]] > Sent: Tuesday, 31 May 2016 6:12 PM > To: [email protected] > Subject: Re: [Oscar-users] Jobs not running on reconfigured cluster > > Hi Richard, > > - Can you see ganglia web interface? > - Are you using a DNS for your cluster? > - Are firewalld / iptables services stopped? > - Is nscd cache reseted? > - is munge running? > - I'm not using torque/maui anymore, so I can't check on my side to see if > there are some specific config to check... > - were the torque / maui package got updated during the process? > > Olivier. > -- > Olivier LAHAYE > CEA DRT/LIST/DIR > > ________________________________________ > De : Richard Young [[email protected]] Envoyé : mardi 31 mai 2016 > 06:29 À : '[email protected]' > Objet : Re: [Oscar-users] Jobs not running on reconfigured cluster > > DongInn > Did check these before but I re-checked as below: > 1. /etc/hosts are the same across the cluster. > 2. can ssh to a node and back without any problems or password. The > known_hosts file has been updated and copied across the cluster. > 3. checked nagios/nrpe and it is setup to allow the admin node to > collect details. > 4. ganglia/gmond is setup to talk to the admin node. > 5. pbs_server and maui on the admin have been restarted with no reported > errors in the log files. > 6. pbs_mom on the nodes has been restarted with no reported errors in > the log files. > 7. a search through /etc and /var/lib/torque for the ip-address of the > server doesn't find anything other old log entries. > 8. /etc/dhcp/dhcpd.conf has been updated. > 9. /etc/ntp.conf has been updated across the cluster. > > Thanks > > --------------------------------------------------------------------- > Richard A. Young > ICT Services > Email: [email protected] Phone: (07) 46315557 > Mob: 0437544370 Fax: (07) 46312798 > --------------------------------------------------------------------- > > -----Original Message----- > From: Kim, DongInn [mailto:[email protected]] > Sent: Tuesday, 31 May 2016 12:05 PM > To: Users OSCAR > Subject: Re: [Oscar-users] Jobs not running on reconfigured cluster > > Hi Richard, > > I would like to double check the following items if I were you. > > 1. /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond are all synced through all > the nodes. > 2. Make sure that the root user can ssh into all the nodes back and forth > without password. > 3. All the daemons of the job submission are running on all the nodes: > (torque-server, torque-mom in the head node and torque-mom in the client > nodes and maui on the head node) > I assume that you are using torque as RM and maui as a scheduler. > > Regards, > > -- > - DongInn > > > >> On May 30, 2016, at 7:25 PM, Richard Young <[email protected]> wrote: >> >> I was hoping somebody would be able to help me with the following problem. >> >> Recently I have applied updates and done some reconfiguration on a RHEL6.8 >> cluster running Oscar. The major change was changing the ipaddress of the >> oscar_server, this was required because changes to the network structure. >> The ipaddress has been applied to /etc/hosts, ssh keys, nagios/nrpe, >> gmetad/gmond etc. However, I have missed something because no jobs will now >> run on the cluster. The jobs basically site in the queue and then get >> cancelled because they have hit their walltime. >> >> Has anybody come across this problem before and be able to supply some >> insight into how to fix the problem(s). >> >> Thanks >> >> --------------------------------------------------------------------- >> Richard A. Young >> ICT Services >> HPC Systems Engineer >> University of Southern Queensland >> Toowoomba, Queensland 4350 >> Australia >> Email: [email protected] Phone: (07) 46315557 >> Mob: 0437544370 Fax: (07) 46312798 >> --------------------------------------------------------------------- >> >> >> >> _____________________________________________________________ >> This email (including any attached files) is confidential and is for the >> intended recipient(s) only. If you received this email by mistake, please, >> as a courtesy, tell the sender, then delete this email. >> >> The views and opinions are the originator's and do not necessarily reflect >> those of the University of Southern Queensland. Although all reasonable >> precautions were taken to ensure that this email contained no viruses at the >> time it was sent we accept no liability for any losses arising from its >> receipt. >> >> The University of Southern Queensland is a registered provider of education >> with the Australian Government. >> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 ) >> >> >> ---------------------------------------------------------------------- >> -------- What NetFlow Analyzer can do for you? Monitors network >> bandwidth and traffic patterns at an interface-level. Reveals which >> users, apps, and protocols are consuming the most bandwidth. Provides >> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make >> informed decisions using capacity planning reports. >> https://ad.doubleclick.net/ddm/clk/305295220;132659582;e >> _______________________________________________ >> Oscar-users mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/oscar-users > > > > _____________________________________________________________ > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, as > a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government. > (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 ) > > > ------------------------------------------------------------------------------ > What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic > patterns at an interface-level. Reveals which users, apps, and protocols are > consuming the most bandwidth. Provides multi-vendor support for NetFlow, > J-Flow, sFlow and other flows. Make informed decisions using capacity > planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e > _______________________________________________ > Oscar-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/oscar-users > > ------------------------------------------------------------------------------ > What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic > patterns at an interface-level. Reveals which users, apps, and protocols are > consuming the most bandwidth. Provides multi-vendor support for NetFlow, > J-Flow, sFlow and other flows. Make informed decisions using capacity > planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e > _______________________________________________ > Oscar-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/oscar-users > > > _____________________________________________________________ > This email (including any attached files) is confidential and is for the > intended recipient(s) only. If you received this email by mistake, please, as > a courtesy, tell the sender, then delete this email. > > The views and opinions are the originator's and do not necessarily reflect > those of the University of Southern Queensland. Although all reasonable > precautions were taken to ensure that this email contained no viruses at the > time it was sent we accept no liability for any losses arising from its > receipt. > > The University of Southern Queensland is a registered provider of education > with the Australian Government. > (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 ) > > > ------------------------------------------------------------------------------ > What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic > patterns at an interface-level. Reveals which users, apps, and protocols are > consuming the most bandwidth. Provides multi-vendor support for NetFlow, > J-Flow, sFlow and other flows. Make informed decisions using capacity > planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e > _______________________________________________ > Oscar-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/oscar-users
signature.asc
Description: Message signed with OpenPGP using GPGMail
------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________ Oscar-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/oscar-users
