Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-21 Thread Martin Langhoff
On Sun, Dec 20, 2009 at 12:57 PM, Martin Langhoff
martin.langh...@gmail.com wrote:
 Yep, I am interested in getting to the bottom of this.

I think I have an initial assessment of the situation.

Clearly, the mnesia DB got corrupted somehow. Because of that...

 - the init script did cannot stop ejabberd normally...

 - killall -9 beam kills the beam processes, which get restarted right
away (such is the magic of erlang's engine failsafe design) by
epmd...

 - Moodle's cronjob talks to ejabberd every 5 minutes. When ejabberd
is broken, you get a pileup of php scripts trying to run ejabberdctl
again and again.

so your attempts to follow my instructions (stop ejabberd, remove
corrupt DB, start it again) didn't succeed.

Now it's up on a pristine state, and I am monitoring it...



m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-21 Thread Martin Langhoff
On Mon, Dec 21, 2009 at 3:14 PM, Martin Langhoff
martin.langh...@gmail.com wrote:
 Now it's up on a pristine state, and I am monitoring it...

Ok - the problem seems related to Moodle's control of ejabberd
presence service. The sync between Moodle and ejabberd data (in
mnesia) was taking too long, and a second Moodle sync process would
start... and then a 3rd... and then...

This led to errors that should be benign (an error reported in the
logs, but not  leading to a functional problem) -- because ejabberd's
internals are all about supporting things that happen concurrently.
But! something inside ejabberd isn't liking the concurrency.

I've added a big lock around the process, so from now on Moodle
processes won't overlap in this sync. This means that your server is
now running a lightly patched Moodle -- I will release this as a new
rpm soon.

According to ps_mem.py, beam started at 14MB and now grown to 16MB,
this is with no users connected. In normal operation (once users
connect), I would expect it to grow to ~40MB.

cheers,


m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-21 Thread Devon Connolly
Ok then.  Thanks a lot for the assistance.  Things seem to be back to  
normal.  I will look closer tomorrow when the kids are here.


___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-20 Thread Martin Langhoff
On Sat, Dec 19, 2009 at 7:32 PM, Devon Connolly dev...@gmail.com wrote:

  - Is there any disk anomaly? (Reboot forcing a fsck?)

 Not that I've noticed.

Ok, but can you try doing a reboot that forces fsck? As follows:

 touch /forcefsck
 reboot

or

  shutdown -Fr now

 Verify checked out on the ejabberd-xs package.

There might be something with the erlang binaries?

 There isn't much sense in reposting the results of the script, as the
 results are essentially the same.  As ejabberd is crashing, I cannot kill
 it to reapply the domain change.  I can set you up an ssh account so you
 can get a look at what is going on.  Perhaps you will see something I am
 overlooking.  Let me know and I will send you the info.

Yep, I am interested in getting to the bottom of this. You'll see a
private email from me soon.

cheers,



m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


[Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-19 Thread Devon Connolly
Here is another example after it has been running all night.

http://pastebin.com/m11537281

As you can see, these runaway beam processes vary greatly in there RAM
usage.  Also, they are always using 100% of the cpu.

I will try to clear the DB now and see what happens.



On Fri, Dec 18, 2009 at 12:51 PM, Martin Langhoff martin.langh...@gmail.com
 wrote:

 On Fri, Dec 18, 2009 at 1:37 PM, Devon Connolly dev...@gmail.com wrote:
  Anyway, back on topic...  Here is that script slightly modified running
 on
  a fresh boot.  I'm going to leave this looping and post the file to
  pastebin.  Here is an initial output after only like 10 minutes.  It will
  get more interesting over time.  I'll paste another later this afternoon.

 outrageous. beam should have only ~40MB in use, total.

 if you 'clear' the mnesia db as i suggested (keep a copy for
 forensics!), does it get better?



 m
 --
  martin.langh...@gmail.com
  mar...@laptop.org -- School Server Architect
  - ask interesting questions
  - don't get distracted with shiny stuff  - working code first
  - http://wiki.laptop.org/go/User:Martinlanghoff

___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-19 Thread Devon Connolly
Changing the domain, I still get the following error when it tries (and
fails to shutdown ejabberd).

___
Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller)
({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
{error_logger,{{2009,12,19},{12,19,16}},Protocol: ~p: register error:
~p~n,[inet_tcp,{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,start_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}
{error_logger,{{2009,12,19},{12,19,16}},crash_report,[[{pid,0.20.0},{registered_name,net_kernel},{error_info,{exit,{error,badarg},[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}},{initial_call,{net_kernel,init,['Argument__1']}},{ancestors,[net_sup,kernel_sup,0.8.0]},{messages,[]},{links,[#Port0.84,0.17.0]},{dictionary,[{longnames,false}]},{trap_exit,true},{status,running},{heap_size,610},{stack_size,23},{reductions,505}],[]]}
{error_logger,{{2009,12,19},{12,19,16}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfa,{net_kernel,start_link,[[ejabberdctl,shortnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
{error_logger,{{2009,12,19},{12,19,16}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfa,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2009,12,19},{12,19,16}},crash_report,[[{pid,0.7.0},{registered_name,[]},{error_info,{exit,{shutdown,{kernel,start,[normal,[]]}},[{application_master,init,4},{proc_lib,init_p_do_apply,3}]}},{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{ancestors,[0.6.0]},{messages,[{'EXIT',0.8.0,normal}]},{links,[0.6.0,0.5.0]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,233},{stack_size,23},{reductions,123}],[]]}
{error_logger,{{2009,12,19},{12,19,16}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}
{Kernel pid
terminated,application_controller,{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]

Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller)
({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
__

Beam is still consuming 100% of the cpu after a few minutes.  I'm going to
leave that script running to see what it does over the next few hours.

I imagine I now have to re-register all XO's?



On Sat, Dec 19, 2009 at 10:59 AM, Devon Connolly dev...@gmail.com wrote:


 Here is another example after it has been running all night.

 http://pastebin.com/m11537281

 As you can see, these runaway beam processes vary greatly in there RAM
 usage.  Also, they are always using 100% of the cpu.

 I will try to clear the DB now and see what happens.



 On Fri, Dec 18, 2009 at 12:51 PM, Martin Langhoff 
 martin.langh...@gmail.com wrote:

 On Fri, Dec 18, 2009 at 1:37 PM, Devon Connolly dev...@gmail.com wrote:
  Anyway, back on topic...  Here is that script slightly modified running
 on
  a fresh boot.  I'm going to leave this looping and post the file to
  pastebin.  Here is an initial output after only like 10 minutes.  It
 will
  get more interesting over time.  I'll paste another later this
 afternoon.

 outrageous. beam should have only ~40MB in use, total.

 if you 'clear' the mnesia db as i suggested (keep a copy for
 forensics!), does it get better?



 m
 --
  martin.langh...@gmail.com
  mar...@laptop.org -- School Server Architect
  - ask interesting questions
  - don't get distracted with shiny stuff  - working code first
  - http://wiki.laptop.org/go/User:Martinlanghoff




___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-19 Thread Martin Langhoff
On Sat, Dec 19, 2009 at 1:31 PM, Devon Connolly dev...@gmail.com wrote:
 Beam is still consuming 100% of the cpu after a few minutes.  I'm going to
 leave that script running to see what it does over the next few hours.

That's really abnormal.

 - Is there any disk anomaly? (Reboot forcing a fsck?)

 - Is there any problem in the binaries? If you run rpm with the
'verify' options, it'll check that no binaries have been corrupted
on-disk... It's normal to see some config files changed, but no
binaries should be different from the rpms.

 I imagine I now have to re-register all XO's?

Nope. The DB gets rebuilt automagically for you, 100%, on XS-0.6 .

cheers,



m


-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-18 Thread Devon Connolly

 Don't reinstall. If possible, let's try to debug this. If you're going
 to give up, just

 1 - Backup /var/lib/ejabberd -- just tar it up
 2 - Use the 'domain_config' script to change the domain -- this will
 re-generate the ejabberd mnesia database. What I'd do: change it to
 'foo.com' and then back to the right domain.

I'd like to debug but I only have about a week left here so I need the  
server to be stable before I leave.  I can debug for awhile, but as we  
approach the holidays, I may need to throw in the table.

 I assume you have the different APs in different channels, and
 generally avoid channel 1 (as that's where XOs engage in 'mesh' by
 default...)...


What we really need is an RF site survey.  Unfortunately, there is nobody  
around that can.  They are on different channels but I am forced to use  
all 3 channels in such a small space.  We also have some rude neighbors  
that decided to amplify their WIFI on channel 6 essentially blanketing the  
school with interference on that channel.  So I have 1 AP on 6, 2 on  
channel 1, and 2 on channel 11.

Anyway, back on topic...  Here is that script slightly modified running on  
a fresh boot.  I'm going to leave this looping and post the file to  
pastebin.  Here is an initial output after only like 10 minutes.  It will  
get more interesting over time.  I'll paste another later this afternoon.

http://pastebin.com/m3426a094
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-18 Thread Martin Langhoff
On Fri, Dec 18, 2009 at 1:37 PM, Devon Connolly dev...@gmail.com wrote:
 Anyway, back on topic...  Here is that script slightly modified running on
 a fresh boot.  I'm going to leave this looping and post the file to
 pastebin.  Here is an initial output after only like 10 minutes.  It will
 get more interesting over time.  I'll paste another later this afternoon.

outrageous. beam should have only ~40MB in use, total.

if you 'clear' the mnesia db as i suggested (keep a copy for
forensics!), does it get better?



m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-17 Thread Devon Connolly
XS Version: 0.6
1 GB Physical Ram, 2GB Swap
154 XO's Registered, Any number connected when the problem happens, 0-XX
The XS is controlling dhcp but nothing out of the ordinary as far as  
leases are concerned.
No Active Antenna

# /home/idmgr/list_registration
http://pastebin.com/m762076bb

# ejabberdctl stats registeredusers
154

# ejabberdctl connected-users

032a8890f8a9731cfc611580524176a1f8f6c...@schoolserver.notredame.sn/Telepathy
0a0c7fd971cdd25851ba34c9df66ef1845900...@schoolserver.notredame.sn/Telepathy
1c058ff553b654a3d808a3ffe95aadf4de841...@schoolserver.notredame.sn/Telepathy
26b8669a3e9387ac726296de07deced5aaf49...@schoolserver.notredame.sn/Telepathy
2f596cc8d6977519411f5c8fcc65e751e8bd3...@schoolserver.notredame.sn/Telepathy
909785500a4fc5e14fe9f1cd7657e7ac34440...@schoolserver.notredame.sn/Telepathy
9b2102f9af673393c9faa1f3565bd28773f48...@schoolserver.notredame.sn/Telepathy
b4e5426593e58970c1b5dafa2adb39e4c3e59...@schoolserver.notredame.sn/Telepathy
b7b58f3b01f49c8c652ddaedffd6faeef555b...@schoolserver.notredame.sn/Telepathy
efb20aece0870421fc0f3facc58653bdac922...@schoolserver.notredame.sn/Telepathy
f9b21026d27589b02b894e221e5531cd1edd1...@schoolserver.notredame.sn/Telepathy

# olpc-netstatus
//The XO's are using gabble

After leaving it on all night, load averages hit 30  It was  
unresponsive and any calls to ejabberdctl yielded the following error:

#ejabberdctl --node ejabb...@schoolserver connected-users
__
{error_logger,{{2009,12,17},{10,0,25}},Protocol: ~p: register error:  
~p~n,[inet_tcp,{{badmatch,{error,duplicate_name}},[{inet_tcp_dist,listen,1},{net_kernel,start_protos,4},{net_kernel,start_protos,3},{net_kernel,init_node,2},{net_kernel,init,1},{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}]}
{error_logger,{{2009,12,17},{10,0,25}},crash_report,[[{pid,0.20.0},{registered_name,net_kernel},{error_info,{exit,{error,badarg},[{gen_server,init_it,6},{proc_lib,init_p_do_apply,3}]}},{initial_call,{net_kernel,init,['Argument__1']}},{ancestors,[net_sup,kernel_sup,0.8.0]},{messages,[]},{links,[#Port0.84,0.17.0]},{dictionary,[{longnames,false}]},{trap_exit,true},{status,running},{heap_size,610},{stack_size,23},{reductions,506}],[]]}
{error_logger,{{2009,12,17},{10,0,25}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,undefined},{name,net_kernel},{mfa,{net_kernel,start_link,[[ejabberdctl,shortnames]]}},{restart_type,permanent},{shutdown,2000},{child_type,worker}]}]}
{error_logger,{{2009,12,17},{10,0,25}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,shutdown},{offender,[{pid,undefined},{name,net_sup},{mfa,{erl_distribution,start_link,[]}},{restart_type,permanent},{shutdown,infinity},{child_type,supervisor}]}]}
{error_logger,{{2009,12,17},{10,0,25}},crash_report,[[{pid,0.7.0},{registered_name,[]},{error_info,{exit,{shutdown,{kernel,start,[normal,[]]}},[{application_master,init,4},{proc_lib,init_p_do_apply,3}]}},{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{ancestors,[0.6.0]},{messages,[{'EXIT',0.8.0,normal}]},{links,[0.6.0,0.5.0]},{dictionary,[]},{trap_exit,true},{status,running},{heap_size,233},{stack_size,23},{reductions,123}],[]]}
{error_logger,{{2009,12,17},{10,0,26}},std_info,[{application,kernel},{exited,{shutdown,{kernel,start,[normal,[]]}}},{type,permanent}]}
{Kernel pid  
terminated,application_controller,{application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]

Crash dump was written to: erl_crash.dump
Kernel pid terminated (application_controller)  
({application_start_failure,kernel,{shutdown,{kernel,start,[normal,[]]}}})
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-17 Thread Martin Langhoff
On Thu, Dec 17, 2009 at 11:35 AM, Devon Connolly dev...@gmail.com wrote:
 XS Version: 0.6
 1 GB Physical Ram, 2GB Swap

Ok - the RAM is on the low side for an XS but should handle 150 ok.

 # ejabberdctl connected-users
...
I counted 12 lines in the output of connected-users. That should not
cause trouble.

 After leaving it on all night, load averages hit 30

 - Did you also leave XOs running connected to it, or were XOs
completely disconnected?

 - Are you perhaps using an AP that does its own DHCP? One way to
check for certain is to connect an XO, and then grep /var/lib/dhcpd/
(or is it /var/spool/dhcpd/ ?) for the MAC address of the XO

 {error_logger,{{2009,12,17},{10,0,25}},Protocol: ~p: register error:

That crash dump is because it cannot spawn the new thread/process --
there's no hint in it of who/what is hogging them.

Seems that ejabberd is consuming all resources (network handles, RAM)
over time, even with no usage or very light usage. This is unexpected.
We did a lot of load-testing of ejabberd, with many clients
connecting, sending msgs, disconnecting over a period of time and we
never saw such resource leaks.

What we saw was memory usage growing a bit with connects/disconnects,
and a GC trimming it down periodically. Memory  cpu use was
reasonably stable over time, within that see-saw.

Is there anything else that could be odd or non-standard in your
setup? Are you in a VM? Is eth0 on the XS configured via dhcp with a
short lease? Is there anything in the network between the XOs and the
XS?

cheers,



m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-17 Thread Martin Langhoff
On Thu, Dec 17, 2009 at 1:12 PM, Martin Langhoff
martin.langh...@gmail.com wrote
 On Thu, Dec 17, 2009 at 11:35 AM, Devon Connolly dev...@gmail.com wrote:
 XS Version: 0.6
 1 GB Physical Ram, 2GB Swap

 Ok - the RAM is on the low side for an XS but should handle 150 ok.

 # ejabberdctl connected-users
 ...
 I counted 12 lines in the output of connected-users. That should not
 cause trouble.

Also - can you get your hands on ps_mem.py, and run it when the
machine is getting into trouble? I want to correlate the output of
ps_mem.py for ejabberd vs the number of connected users, run something
like this on a console

while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd;
ejabberdctl connected-users | wc-l)  mylog ; sleep 60 ; done;

untested, may need tweaking to work properly. If you run it during the
day and also during the night, will be most interesting.

cheers,


m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-17 Thread Devon Connolly
The server had an uptime of about 50 days before this occurred.  There were
no problems and nothing has changed in the 2 or so days since this problem
began.  Like had said previously, it seems to have occurred since reflashing
and re-registering a student's XO, but I believe that to be a coincidence.

 - Are you perhaps using an AP that does its own DHCP? One way to
 check for certain is to connect an XO, and then grep /var/lib/dhcpd/
 (or is it /var/spool/dhcpd/ ?) for the MAC address of the XO

We are using 5 wireless AP's.  4 of which are Linksys WRT54G's running
DD-WRT and one is a D-Link modem/AP combo.  DHCP is deactivated on all of
the above.


 - Did you also leave XOs running connected to it, or were XOs
 completely disconnected?

I believe all XO's were disconnected.  It is possible some were left
connected while in their charging cabinets, but doubtful.

Is there anything else that could be odd or non-standard in your
setup? Are you in a VM? Is eth0 on the XS configured via dhcp with a
short lease? Is there anything in the network between the XOs and the
XS?

Nothing non-standard really.  eth0 is fixed.  Although, this server came
pre-installed from the folks involved with the Give One Get One program in
Rwanda.  I'm not sure what was modified from the stock server install.  I am
debating reinstalling the server from scratch.

I haven't been paying as much attention to the server lately as I should.
As it had been running for about 50 days, I only checked in with the school
periodically.  There were problems but mainly in relation to the presence
service and reliably connecting 30 - 100 laptops to the network at one
time.  I attribute this behavior to the Linksys AP's as they only seem to
handle about 20 connections per AP reliably.  There is also a good amount of
wireless interference to contend with; however, the server was working
well.  As it is a bit under-powered, load averages generally stay within the
1.2-1.5 range.

As I write this, the server has an uptime of about 9 hours.  Load averages
have reached 25 across the board.  The dump files have consumed over a gig
of space filling up the root partition.

while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd;
ejabberdctl connected-users | wc-l)  mylog ; sleep 60 ; done;

Tried the script at night with the high load, and it cannot complete as the
ejabberd node has since crashed.  ejabberdctl yields the following error:

_
RPC failed on the node ejabb...@schoolserver: {'EXIT',
   {badarg,
[{ets,lookup,
  [hooks,
   {ejabberd_ctl_process,
global}]},

{ejabberd_hooks,run_fold,4},
 {ejabberd_ctl,process,1},
 {rpc,
  '-handle_call/3-fun-0-',
  5}]}}
__

Individually issuing the commands:
# vmstat
Thu Dec 17 20:07:19 UTC 2009
procs ---memory-- ---swap-- -io --system--
-cpu--
 r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id
wa st
25  0 705768  63912 123132 239040   53   92   153   711 1089  539 61 38  0
1  0

# ps_mem.py | grep ejabberd

No output

I've included a screenshot of htop for your viewing pleasure.

http://omploader.org/vMzBvZQ/htop_screen.jpg

I'll give you more relevant info tomorrow.


On Thu, Dec 17, 2009 at 12:16 PM, Martin Langhoff martin.langh...@gmail.com
 wrote:

 On Thu, Dec 17, 2009 at 1:12 PM, Martin Langhoff
 martin.langh...@gmail.com wrote
  On Thu, Dec 17, 2009 at 11:35 AM, Devon Connolly dev...@gmail.com
 wrote:
  XS Version: 0.6
  1 GB Physical Ram, 2GB Swap
 
  Ok - the RAM is on the low side for an XS but should handle 150 ok.
 
  # ejabberdctl connected-users
  ...
  I counted 12 lines in the output of connected-users. That should not
  cause trouble.

 Also - can you get your hands on ps_mem.py, and run it when the
 machine is getting into trouble? I want to correlate the output of
 ps_mem.py for ejabberd vs the number of connected users, run something
 like this on a console

 while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd;
 ejabberdctl connected-users | wc-l)  mylog ; sleep 60 ; done;

 untested, may need tweaking to work properly. If you run it during the
 day and also during the night, will be most interesting.

 cheers,


 m
 --
  martin.langh...@gmail.com
  mar...@laptop.org -- School Server Architect
  - ask interesting questions
  - don't get distracted with shiny stuff  - working code first
  - http://wiki.laptop.org/go/User:Martinlanghoff


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-17 Thread Martin Langhoff
On Thu, Dec 17, 2009 at 9:32 PM, Devon Connolly dev...@gmail.com wrote:
 The server had an uptime of about 50 days before this occurred.  There were
 no problems and nothing has changed in the 2 or so days since this problem
 began.  Like had said previously, it seems to have occurred since reflashing
 and re-registering a student's XO, but I believe that to be a coincidence.

Hmmm, maybe something's gone wonky on the mnesia DB.

 We are using 5 wireless AP's.  4 of which are Linksys WRT54G's running
 DD-WRT and one is a D-Link modem/AP combo.  DHCP is deactivated on all of
 the above.

Good.

 - Did you also leave XOs running connected to it, or were XOs
 completely disconnected?

 I believe all XO's were disconnected.  It is possible some were left
 connected while in their charging cabinets, but doubtful.

Ok. Then ejabberd is getting messedup all on its own...

 Nothing non-standard really.  eth0 is fixed.

good

 Although, this server came
 pre-installed from the folks involved with the Give One Get One program in
 Rwanda.  I'm not sure what was modified from the stock server install.  I am
 debating reinstalling the server from scratch.

Don't reinstall. If possible, let's try to debug this. If you're going
to give up, just

1 - Backup /var/lib/ejabberd -- just tar it up
2 - Use the 'domain_config' script to change the domain -- this will
re-generate the ejabberd mnesia database. What I'd do: change it to
'foo.com' and then back to the right domain.

 I attribute this behavior to the Linksys AP's as they only seem to
 handle about 20 connections per AP reliably.

yeah. we've seen that plenty.

  There is also a good amount of
 wireless interference to contend with; however, the server was working
 well.

I assume you have the different APs in different channels, and
generally avoid channel 1 (as that's where XOs engage in 'mesh' by
default...)...


while true; do (echo `date -u `; vmstat; ps_mem.py | grep ejabberd;
ejabberdctl connected-users | wc-l)  mylog ; sleep 60 ; done;

 Tried the script at night with the high load, and it cannot complete as the
 ejabberd node has since crashed.  ejabberdctl yields the following error:

Can you restart ejabberd and try that script?


 # ps_mem.py | grep ejabberd

 No output

Did you download ps_mem.py, and make it executable? (google the name
if needed) If so, you might want to grep for erl instead.

 I've included a screenshot of htop for your viewing pleasure.
 http://omploader.org/vMzBvZQ/htop_screen.jpg

ejbabberd sure looks busy there...



m
-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel


Re: [Server-devel] Ejabberd CPU/RAM Spike - Crashes

2009-12-16 Thread Martin Langhoff
Hi Devon,

Sure we can debug this. Lots of questions for you

 - version of XS?

 - How much physical RAM?

 - Number of XOs registered, and in use on the network when the problem happens

 - Output of the commands suggested in
http://wiki.laptop.org/go/XS_Techniques_and_Configuration#Presence_Service_.28ejabberd.29_Troubleshooting

 - Is there anything in the network that may be forcing lots of dhcpd
lease reassigns? Is the XS controlling dhcp for the XOs?

 - Are you by any chance using our old (and now unsupported) 'Active
Antenna' on the XS?

cheers,


m

On Wed, Dec 16, 2009 at 8:28 PM, Devon Connolly dev...@gmail.com wrote:
 I'm having some issues with ejabbered after re-flashing and re-registering a
 student's XO. No other changes were made to the server; however, the beam
 process has begun to constantly use 100% cpu while the ram usage swells to
 over 1GB and then proceeds to eat the 2GB swap.  This continues until the
 load average of the server reaches ~14,14,14 at which time the server
 becomes unresponsive.

 Multiple erl crash logs are being created (about 5-10 per minute) in
 /var/log/ejabberd.  A brief excerpt:

 erl_crash_20091216-124645.dump
 _
 =erl_crash_dump:0.1
 Wed Dec 16 12:46:47 2009
 Slogan: Kernel pid terminated (application_controller)
 ({application_start_failure, kernel, {shutdown, {kernel, start, [normal,
 []]}}})
 System version: Erlang (BEAM) emulator version 5.6.5 [source]
 [async-threads:0] [hipe][kernel-poll:false]

 --
 Anyway, each of these crash dump files are thousands of lines.  Any ideas
 for debugging this?

 Thanks

 ___
 Server-devel mailing list
 Server-devel@lists.laptop.org
 http://lists.laptop.org/listinfo/server-devel





-- 
 martin.langh...@gmail.com
 mar...@laptop.org -- School Server Architect
 - ask interesting questions
 - don't get distracted with shiny stuff  - working code first
 - http://wiki.laptop.org/go/User:Martinlanghoff
___
Server-devel mailing list
Server-devel@lists.laptop.org
http://lists.laptop.org/listinfo/server-devel