xCat Service node specs:
CPU - 2 x Dual-Core AMD Opteron(tm) Processor 2212
Memory - 6 gigs (2 x 1gig and 2 x 512mb per CPU)
Diskful - Shared /install, shared /tftpboot
Useflowcontrol is disabled. xCat master was 2.8.2 then upgraded to
2.8.3
xCat DB contains 7851 nodes in nodelist table.
I also did a few timing tests with running updateflag.awk from a
compute node to each servicenode and master.
One service node take 3 times longer than the other 2 service nodes.
Using the xcat Master (cmx04-hc) is 3 times faster than using any
service node.
[root@c103n69 xcatpost]# time ./updateflag.awk servicefarm03-hc 3002
"installstatus booted"
real 0m10.175s
user 0m0.000s
sys 0m0.002s
[root@c103n69 xcatpost]# time ./updateflag.awk servicefarm02-hc 3002
"installstatus booted"
real 0m3.679s
user 0m0.001s
sys 0m0.001s
[root@c103n69 xcatpost]# time ./updateflag.awk servicefarm01-hc 3002
"installstatus booted"
real 0m3.653s
user 0m0.002s
sys 0m0.000s
[root@c103n69 xcatpost]# time ./updateflag.awk master 3002
"installstatus booted"
real 0m0.491s
user 0m0.000s
sys 0m0.002s
All of the test were done with the service and master nodes were
idle. For all the service nodes the CPU usage for the Install
Monitor process jumped up to 99% for the entire duration.
This did not seem to happen on the master node (but then again it
was done in under a second)
There are also instances where the Install Monitor process totally
stops responding on a servicenode. This caused the updateflag.awk
command to hang indefinitely on the compute nodes.
Issuing a “service xcatd reload” appears to restart the Install
Monitor and it starts to respond again.. However at this point the
updateflag.awk process had to be killed on the compute nodes which
were hung.
I have looked into enabling the useflowcontrol option since the Docs
say it is enabled by default for new 2.8.3 installs. (Ours was
upgraded from 2.8.2, thus was not enabled)
No errors are present in any logs that are pointing to an issue.
Once useflowcontrol is enabled in the site table, will xcatd have to
be restarted/reloaded on the master and service nodes?
On 3/11/2014 7:32 AM, Lissa Valletta
wrote:
This is happening with only 50
nodes? What size memory and how many CPU's do you have on
the service nodes. Are the service nodes diskfull installed.
What is the setting for site attribute useflowcontrol, if it
is set at all. We have many systems installing a lot more
nodes than 50 without this problem
What is the OS and xCAT level
on the servicenode and Management Node.
Can you monitor
/var/log/messages on the Management Node during this to see if
you are seeing errors from xcatd on the service node.
Lissa K. Valletta
8-3/B10
Poughkeepsie, NY 12601
(tie 293) 433-3102
Russell Jones
---03/10/2014 05:25:55 PM---Hi Wang, As a followup to this, is
there any additional performance tips or
From: Russell Jones
<russell-l...@jonesmail.me>
To: xcat-user@lists.sourceforge.net,
Date: 03/10/2014 05:25 PM
Subject: Re: [xcat-user] Additional
performance issues
Hi Wang,
As a followup to this, is there any additional performance tips
or things we can do to assist with this performance issue? I
don't think that the problem we are seeing is caused by the
"getpostscripts" request, as it doesn't seem like
getpostscript.awk connects to port 3002, the port that the
"Install Monitor" uses. As a reminder, the problem we are seeing
is the "Install Monitor" process on the service node being
pegged at 100% CPU, and nodes hanging before showing their login
console, *after* the postscripts have already finished. When
nodes finally start showing their login console the "install
monitor" service starts lowering its CPU usage.
After disabling the site.nodestatus feature we seem to have
resolved the performance issue. However we would like to use
this feature if we can, as long as it doesn't cause such a huge
drop in service node performance.
Thanks!
On 3/2/2014 8:59 PM, Xiao Peng Wang
wrote:
I cannot see any impact except
the status of node won't be update when setting
'site.nodestatus=n', you can try to leverage it to solve your
issue.
But I am thinking maybe your issue caused by the running of
'getpostscripts' when booting the diskless nodes. You can try
to set the site.precreatemypostscripts=1 to improve the
performance of getmypostscript operation.
I suggest you to try them one by one and then both to see the
result.
Thanks
Best Regards
----------------------------------------------------------------------
Wang Xiaopeng (王晓朋)
IBM China System Technology Laboratory
Tel: 86-10-82453455
Email: w...@cn.ibm.com
Address: 28,ZhongGuanCun Software Park,No.8 Dong Bei Wang West
Road, Haidian District Beijing P.R.China 100193
Russell Jones ---2014/03/02
13:18:43---Hi all, We are seeing pretty consistently that when
50+ diskless nodes are
From: Russell Jones <russell-l...@jonesmail.me>
To: xcat-user@lists.sourceforge.net,
Date: 2014/03/02
13:18
Subject: [xcat-user]
Additional performance issues
Hi all,
We are seeing pretty consistently that when 50+ diskless
nodes are
booted against the same single service node, before showing
their login
console they all hang for around 5-10 minutes after
postscripts run
while the service node is chugging away at using a constant
100% of a
single core. The process on the service node that is at
fault is the
"install monitor".
This seems to be tied to the site.nodestatus option. Other
than
disabling this option (and fixing the diskfull reinstall
loop bug with
it that I reported earlier), is there a way of improving the
responsiveness of a service node when it's updating the
node's status?
Would we lose anything from disabling this option besides
the "status"
column not being updated?
Thanks!
------------------------------------------------------------------------------
Flow-based real-time traffic analytics software. Cisco
certified tool.
Monitor traffic, SLAs, QoS, Medianet, WAAS etc. with NetFlow
Analyzer
Customize your own dashboards, set traffic alerts and
generate reports.
Network behavioral analysis & security monitoring.
All-in-one tool.
http://pubads.g.doubleclick.net/gampad/clk?id=126839071&iu=/4140/ostg.clktrk
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
------------------------------------------------------------------------------
Subversion Kills Productivity. Get off Subversion & Make
the Move to Perforce.
With Perforce, you get hassle-free workflows. Merge that
actually works.
Faster operations. Version large binaries. Built-in WAN
optimization and the
freedom to use Git, Perforce or both. Make the move to
Perforce.
http://pubads.g.doubleclick.net/gampad/clk?id=122218951&iu=/4140/ostg.clktrk
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph
databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book
today!
http://p.sf.net/sfu/13534_NeoTech_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/13534_NeoTech
_______________________________________________
xCAT-user mailing list
xCAT-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/xcat-user
|