Hi Richard,

I think that this is a torque+maui configuration issue on your cluster.

Can you please make sure that your configurations of torque and maui are setup 
properly?
I hope that you can find the torque and maui admin manual on google.

One thing that I would like to play with is to see what log messages are 
generated on the server and client sides when a new job is submitted.
That would show many hints on your problem.

Regards,

--
- DongInn



> On May 31, 2016, at 8:40 PM, Richard Young <richard.yo...@usq.edu.au> wrote:
> 
> Lahaye
> - No I can't see the ganglia web interface on either the public or private 
> interfaces, it says "you have no permission"
> - the admin node is setup as a forwarding dns server and lookups seem to work 
> correctly
> - the firewall/iptables services have been stopped, with on an iptables rule 
> set from the command line to forward and NAT traffic
> - nscd cache has been turned off
> - munge is running
> - torque/maui packages did get updated, configurations have been check to 
> make certain they were the same as before the update.
> 
> Thanks
> 
> ---------------------------------------------------------------------
> Richard A. Young
> ICT Services
> Email: richard.yo...@usq.edu.au   Phone: (07) 46315557
> Mob:   0437544370          Fax:   (07) 46312798
> ---------------------------------------------------------------------
> 
> -----Original Message-----
> From: LAHAYE Olivier [mailto:olivier.lah...@cea.fr]
> Sent: Tuesday, 31 May 2016 6:12 PM
> To: oscar-users@lists.sourceforge.net
> Subject: Re: [Oscar-users] Jobs not running on reconfigured cluster
> 
> Hi Richard,
> 
> - Can you see ganglia web interface?
> - Are you using a DNS for your cluster?
> - Are firewalld / iptables services stopped?
> - Is nscd cache reseted?
> - is munge running?
> - I'm not using torque/maui anymore, so I can't check on my side to see if 
> there are some specific config to check...
> - were the torque / maui package got updated during the process?
> 
> Olivier.
> --
>   Olivier LAHAYE
>   CEA DRT/LIST/DIR
> 
> ________________________________________
> De : Richard Young [richard.yo...@usq.edu.au] Envoyé : mardi 31 mai 2016 
> 06:29 À : 'oscar-users@lists.sourceforge.net'
> Objet : Re: [Oscar-users] Jobs not running on reconfigured cluster
> 
> DongInn
> Did check these before but I re-checked as below:
> 1.      /etc/hosts are the same across the cluster.
> 2.      can ssh to a node and back without any problems or password. The 
> known_hosts file has been updated and copied across the cluster.
> 3.      checked nagios/nrpe and it is setup to allow the admin node to 
> collect details.
> 4.      ganglia/gmond is setup to talk to the admin node.
> 5.      pbs_server and maui on the admin have been restarted with no reported 
> errors in the log files.
> 6.      pbs_mom on the nodes has been restarted with no reported errors in 
> the log files.
> 7.      a search through /etc and /var/lib/torque for the ip-address of the 
> server doesn't find anything other old log entries.
> 8.      /etc/dhcp/dhcpd.conf has been updated.
> 9.       /etc/ntp.conf has been updated across the cluster.
> 
> Thanks
> 
> ---------------------------------------------------------------------
> Richard A. Young
> ICT Services
> Email: richard.yo...@usq.edu.au   Phone: (07) 46315557
> Mob:   0437544370          Fax:   (07) 46312798
> ---------------------------------------------------------------------
> 
> -----Original Message-----
> From: Kim, DongInn [mailto:di...@indiana.edu]
> Sent: Tuesday, 31 May 2016 12:05 PM
> To: Users OSCAR
> Subject: Re: [Oscar-users] Jobs not running on reconfigured cluster
> 
> Hi Richard,
> 
> I would like to double check the following items if I were you.
> 
> 1. /etc/hosts, ssh keys, nagios/nrpe, gmetad/gmond are all synced through all 
> the nodes.
> 2. Make sure that the root user can ssh into all the nodes back and forth 
> without password.
> 3. All the daemons of the job submission are running on all the nodes:
>    (torque-server, torque-mom in the head node and torque-mom in the client 
> nodes and maui on the head node)
>    I assume that you are using torque as RM and maui as a scheduler.
> 
> Regards,
> 
> --
> - DongInn
> 
> 
> 
>> On May 30, 2016, at 7:25 PM, Richard Young <richard.yo...@usq.edu.au> wrote:
>> 
>> I was hoping somebody would be able to help me with the following problem.
>> 
>> Recently I have applied updates and done some reconfiguration on a RHEL6.8 
>> cluster running Oscar. The major change was changing the ipaddress of the 
>> oscar_server, this was required because changes to the network structure. 
>> The ipaddress has been applied to /etc/hosts, ssh keys, nagios/nrpe, 
>> gmetad/gmond etc. However, I have missed something because no jobs will now 
>> run on the cluster. The jobs basically site in the queue and then get 
>> cancelled because they have hit their walltime.
>> 
>> Has anybody come across this problem before and be able to supply some 
>> insight into how to fix the problem(s).
>> 
>> Thanks
>> 
>> ---------------------------------------------------------------------
>> Richard A. Young
>> ICT Services
>> HPC Systems Engineer
>> University of Southern Queensland
>> Toowoomba, Queensland 4350
>> Australia
>> Email: richard.yo...@usq.edu.au   Phone: (07) 46315557
>> Mob:   0437544370          Fax:   (07) 46312798
>> ---------------------------------------------------------------------
>> 
>> 
>> 
>> _____________________________________________________________
>> This email (including any attached files) is confidential and is for the 
>> intended recipient(s) only. If you received this email by mistake, please, 
>> as a courtesy, tell the sender, then delete this email.
>> 
>> The views and opinions are the originator's and do not necessarily reflect 
>> those of the University of Southern Queensland. Although all reasonable 
>> precautions were taken to ensure that this email contained no viruses at the 
>> time it was sent we accept no liability for any losses arising from its 
>> receipt.
>> 
>> The University of Southern Queensland is a registered provider of education 
>> with the Australian Government.
>> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
>> 
>> 
>> ----------------------------------------------------------------------
>> -------- What NetFlow Analyzer can do for you? Monitors network
>> bandwidth and traffic patterns at an interface-level. Reveals which
>> users, apps, and protocols are consuming the most bandwidth. Provides
>> multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make
>> informed decisions using capacity planning reports.
>> https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
>> _______________________________________________
>> Oscar-users mailing list
>> Oscar-users@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/oscar-users
> 
> 
> 
> _____________________________________________________________
> This email (including any attached files) is confidential and is for the 
> intended recipient(s) only. If you received this email by mistake, please, as 
> a courtesy, tell the sender, then delete this email.
> 
> The views and opinions are the originator's and do not necessarily reflect 
> those of the University of Southern Queensland. Although all reasonable 
> precautions were taken to ensure that this email contained no viruses at the 
> time it was sent we accept no liability for any losses arising from its 
> receipt.
> 
> The University of Southern Queensland is a registered provider of education 
> with the Australian Government.
> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
> 
> 
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic 
> patterns at an interface-level. Reveals which users, apps, and protocols are 
> consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
> J-Flow, sFlow and other flows. Make informed decisions using capacity 
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Oscar-users mailing list
> Oscar-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users
> 
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic 
> patterns at an interface-level. Reveals which users, apps, and protocols are 
> consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
> J-Flow, sFlow and other flows. Make informed decisions using capacity 
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Oscar-users mailing list
> Oscar-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users
> 
> 
> _____________________________________________________________
> This email (including any attached files) is confidential and is for the 
> intended recipient(s) only. If you received this email by mistake, please, as 
> a courtesy, tell the sender, then delete this email.
> 
> The views and opinions are the originator's and do not necessarily reflect 
> those of the University of Southern Queensland. Although all reasonable 
> precautions were taken to ensure that this email contained no viruses at the 
> time it was sent we accept no liability for any losses arising from its 
> receipt.
> 
> The University of Southern Queensland is a registered provider of education 
> with the Australian Government.
> (CRICOS Institution Code QLD 00244B / NSW 02225M, TEQSA PRV12081 )
> 
> 
> ------------------------------------------------------------------------------
> What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
> patterns at an interface-level. Reveals which users, apps, and protocols are
> consuming the most bandwidth. Provides multi-vendor support for NetFlow,
> J-Flow, sFlow and other flows. Make informed decisions using capacity
> planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
> _______________________________________________
> Oscar-users mailing list
> Oscar-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to