Hi Mike,
On 06/18/14 21:26, Mike Hanby wrote:
The last file system scan was the result of the database crashing due
to memory over consumption. When I run a manual scan (sudo
/usr/sbin/robinhood -L stdout --scan --once) the database due to a
steady memory consumption until all memory and swap is consumed (see
attached graph). I tried bumping the system RAM from 4G to 16G the
issue persists, it just takes longer to consume all memory (approx 3
hours). The robinhood vm is running it's own MySQL server.
I can't see in the "ps auxf" or top output what's using the memory.
Everything shows as using , only that free shows it being consumed and
eventually the database dies. (See attached graph image).
The following output is with 10G of 16G consumed
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1561 0.0 0.0 108204 1196 ? S Jun13 0:00 /bin/sh
/usr/bin/mysqld_safe --datadir=/var/lib/mysql
--socket=/var/lib/mysql/mysql.sock
--pid-file=/var/run/mysqld/mysqld.pid --basedir=/usr --user=mysql
mysql 1695 3.5 0.8 1606356 145168 ? Sl Jun13 261:09 \_
/usr/libexec/mysqld --basedir=/usr --datadir=/var/lib/mysql
--user=mysql --log-error=/var/lib/mysql/robinhood.uabgrid.uab.edu.err
--pid-file=/var/run/mysqld/mysqld.pid
--socket=/var/lib/mysql/mysql.sock --port=3306
root 3914 20.3 0.0 748796 11164 ? Ssl Jun16 587:18
/usr/sbin/robinhood -d -f /etc/robinhood.d/tmpfs/scratch.conf -p
/var/run/rbh.scratch
$ free -m
total used free shared buffers cached
Mem: 15949 15623 326 0 164 5447
-/+ buffers/cache: 10011 5938
Swap: 3967 17 3950
Indeed, Robinhood and Mysql don't use a lot of memory.
'free' output indicates that buffers and cache use most of the memory,
so I suspect a memory management issue in the Lustre client (e.g. inode
caching?).
You can try "cat /proc/slabinfo". It shows allocated kernel structures,
so it may indicate the guilty object.
Maybe I just need to wipe the database and start over.
I don't think your issue has something to do with MySQL.
And here's the latest from the logs:
2014/06/18 09:25:51 [3914/1] STATS | ==================== Dumping
stats at 2014/06/18 09:25:51 =====================
2014/06/18 09:25:51 [3914/1] STATS | ======== General statistics =========
2014/06/18 09:25:51 [3914/1] STATS | Daemon start time: 2014/06/16
14:10:04
2014/06/18 09:25:51 [3914/1] STATS | Started modules: log_reader
2014/06/18 09:25:51 [3914/1] STATS | ChangeLog reader #0:
2014/06/18 09:25:51 [3914/1] STATS | fs_name = lustre
2014/06/18 09:25:51 [3914/1] STATS | mdt_name = MDT0000
2014/06/18 09:25:51 [3914/1] STATS | reader_id = cl1
2014/06/18 09:25:51 [3914/1] STATS | records read = 9468782
2014/06/18 09:25:51 [3914/1] STATS | interesting records = 5737340
2014/06/18 09:25:51 [3914/1] STATS | suppressed records = 3731442
2014/06/18 09:25:51 [3914/1] STATS | records pending = 0
2014/06/18 09:25:51 [3914/1] STATS | last received =
2014/06/18 09:25:19
2014/06/18 09:25:51 [3914/1] STATS | last read record time =
2014/06/18 09:25:18.258686
2014/06/18 09:25:51 [3914/1] STATS | last read record id =
35953796
2014/06/18 09:25:51 [3914/1] STATS | last pushed record id =
35953796
2014/06/18 09:25:51 [3914/1] STATS | last committed record id =
35953796
2014/06/18 09:25:51 [3914/1] STATS | last cleared record id =
35953796
2014/06/18 09:25:51 [3914/1] STATS | read speed =
6.00 record/sec (0.05 incl. idle time)
2014/06/18 09:25:51 [3914/1] STATS | processing speed ratio = 0.98
2014/06/18 09:25:51 [3914/1] STATS | ChangeLog stats:
2014/06/18 09:25:51 [3914/1] STATS | MARK: 0, CREAT: 1882493,
MKDIR: 99656, HLINK: 26, SLINK: 49497, MKNOD: 334, UNLNK: 1045468
2014/06/18 09:25:51 [3914/1] STATS | RMDIR: 1737, RENME: 634549,
RNMTO: 0, OPEN: 0, CLOSE: 2236913, LYOUT: 0, TRUNC: 0
2014/06/18 09:25:51 [3914/1] STATS | SATTR: 3426566, XATTR: 1, HSM:
0, MTIME: 9250, CTIME: 82292, ATIME: 0
2014/06/18 09:25:51 [3914/1] STATS | ==== EntryProcessor Pipeline
Stats ===
2014/06/18 09:25:51 [3914/1] STATS | Idle threads: 8
2014/06/18 09:25:51 [3914/1] STATS | Id constraints count: 0 (hash
min=0/max=0/avg=0.0)
2014/06/18 09:25:51 [3914/1] STATS | Stage | Wait | Curr | Done |
Total | ms/op |
2014/06/18 09:25:51 [3914/1] STATS | 0: GET_FID | 0 | 0 | 0
| 0 | 0.00 |
2014/06/18 09:25:51 [3914/1] STATS | 1: GET_INFO_DB | 0 | 0
| 0 | 6374430 | 3.60 |
2014/06/18 09:25:51 [3914/1] STATS | 2: GET_INFO_FS | 0 | 0
| 0 | 4690135 | 8.62 |
2014/06/18 09:25:51 [3914/1] STATS | 3: REPORTING | 0 | 0 |
0 | 3610533 | 0.01 |
2014/06/18 09:25:51 [3914/1] STATS | 4: PRE_APPLY | 0 | 0 |
0 | 4269877 | 0.00 |
2014/06/18 09:25:51 [3914/1] STATS | 5: DB_APPLY | 0 | 0 | 0
| 4269877 | 8.25 | 51.81% batched (avg batch size: 9.3)
2014/06/18 09:25:51 [3914/1] STATS | 6: CHGLOG_CLR | 0 | 0 |
0 | 6374430 | 0.07 |
2014/06/18 09:25:51 [3914/1] STATS | 7: RM_OLD_ENTRIES | 0 | 0
| 0 | 0 | 0.00 |
2014/06/18 09:25:51 [3914/1] STATS | DB ops:
get=3252948/ins=1017541/upd=2592992/rm=659344
This looks sane.
Regards,
Thomas
------------------------------------------------------------------------------
HPCC Systems Open Source Big Data Platform from LexisNexis Risk Solutions
Find What Matters Most in Your Big Data with HPCC Systems
Open Source. Fast. Scalable. Simple. Ideal for Dirty Data.
Leverages Graph Analysis for Fast Processing & Easy Data Exploration
http://p.sf.net/sfu/hpccsystems
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support