Hi Carmelo,
* An overview of your robinhood pipeline would help determining where it
spends time.
grep STATS /var/log/robinhood.daint.log
* Running "sar -d" on you robinhood host for the DB storage disk, is the
last column (%util) close to 100%?
* If you DB disk backend is not fast, I suggest you use DB request
batching. In robinhood config (EntryProcessor block):
max_batch_size = 1000;
Or you can try increasing processing threads, in EntryProcessor block:
nb_threads = 32 ;
See:
https://github.com/cea-hpc/robinhood/wiki/tmpfs_admin_guide#entry-processor-pipeline-options
* Looking at "top", what is the memory and CPU usage of mysqld and
robinhood under full load?
* Regarding the DB tunings:|
|||- you can increase |||innodb_buffer_pool_size|||(e.g. to 100G)|||as
your host memory is 132GB)||. So you will cache more data of your DB.
|||- try with "|||||innodb_flush_log_at_trx_commit = 2|" instead of 0.
- you can add the following tunings:
|performance_schema|||||
innodb_additional_mem_pool_size = 16M
||||innodb_file_per_table = 1
||||||innodb_flush_method=O_DIRECT
||||||||||innodb_write_io_threads = 32
innodb_read_io_threads = 32
||||||# how many IOPS your DB storage can handle?
||innodb_io_capacity=50000
|||||innodb_thread_concurrency = 0
|innodb_log_files_in_group = 4||
Regards,
Thomas
On 04/20/15 16:04, [email protected] wrote:
Subject:
Robinhood: tuning advice request
From:
"Carmelo Ponti (CSCS)" <[email protected]>
Date:
04/20/15 15:51
Dear all
We installed robinhood 2.5.4 on a dual socket 8 x CPU E5-2650 v2 @
2.60GHz cores with 132 GB RAM memory and we want to manage a 2.7 PB
Lustre Filesystem based on a SONEXION 1600, 24 x SSU storage. Currently
the file system is 62% full and Lustre Changelog is activated.
Our policies are to purge all +30 days files old and remove all empty
directory.
Last Friday restarted everything from scratch because the Lustre
Changelog was too full and robinhood could not manage it anymore. First
I stopped robinhood, I recreated the mysql DB and then I started again
robinhood again as following:
# lfs changelog_clear snx11026-MDT0000 cl1 0 ; /etc/init.d/robinhood
start
with /etc/sysconfig/robinhood configured as following:
RBH_OPT="--scan --read-log --purge --rmdir"
In the same time I monitored the entries on the Changelog with ganglia
counting the number of lines:
# /usr/bin/lfs changelog snx11026-MDT0000 | wc -l
The scan lasted more than 24h and the number of lines on Lustre
Changelog grew to 5 million and the shrank and stabilized to 50k - 100k.
The scan finished Sunday morning without errors:
2015/04/19 00:37:17 robinhood@daintrbh01[3579/22] FS_Scan | Full scan
of /scratch/daint completed, 26424073 entries found (0 errors). Duration
= 120700.99s
After that Lustre Changelog started to grew again so I decided to
restart robinhood without --scan:
RBH_OPT="--read-log --purge --rmdir"
Everything worked perfectly until this morning when users started to
work very intensively and the Lustre Changelog grew up to 8M, the load
average is constantly 39.34, 37.81, 35.30 and the situation is going
worse and worse.
Is it the HW we are using to small for our file system size and usage?
Could you please advice me some tuning?
Thank you in advance
Carmelo Ponti
Following some other details concerning our set up:
OS version
----------
CentOS release 6.4 (Final)
SONEXION Lustre version
-----------------------
$ cat /proc/fs/lustre/version
lustre: 2.1.0
kernel: patchless_client
build:
jenkins-163-gf03b2cb-CHANGED-2.6.32-220.7.1.el6.lustre.4026.x86_64
LUSTRE client version
---------------------
# cat /proc/fs/lustre/version
lustre: 2.5.3
kernel: patchless_client
build: 2.5.3-RC1--PRISTINE-2.6.32-431.5.1.el6.x86_64
MYSQL version
-------------
# mysql --version
mysql Ver 14.14 Distrib 5.5.40, for Linux (x86_64) using readline 5.1
MYSQL my.cnf
------------
[mysqld]
large-pages
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
user=mysql
symbolic-links=0
max_connections= 128
innodb_flush_log_at_trx_commit = 0
innodb_buffer_pool_size= 30G
innodb_max_dirty_pages_pct= 15
innodb_thread_concurrency= 32
innodb_log_file_size= 100M
innodb_log_buffer_size= 50M
innodb_data_file_path= ibdata1:1G:autoextend
innodb_lock_wait_timeout=120
table-open-cache= 2000
sort-buffer-size= 32M
read-buffer-size= 16M
read-rnd-buffer-size= 4M
thread-cache-size= 128
query-cache-size= 40M
query-cache-limit= 1M
tmp-table-size= 16M
[mysqld_safe]
log-error=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
/etc/sysctl.conf
----------------
# Kernel sysctl configuration file for Red Hat Linux
#
# For binary values, 0 is disabled, 1 is enabled. See sysctl(8) and
# sysctl.conf(5) for more details.
# Controls IP packet forwarding
net.ipv4.ip_forward = 0
# Controls source route verification
net.ipv4.conf.default.rp_filter = 2
# Do not accept source routing
net.ipv4.conf.default.accept_source_route = 0
# Controls the System Request debugging functionality of the kernel
kernel.sysrq = 0
# Controls whether core dumps will append the PID to the core filename.
# Useful for debugging multi-threaded applications.
kernel.core_uses_pid = 1
# Controls the use of TCP syncookies
net.ipv4.tcp_syncookies = 1
# Disable netfilter on bridges.
#net.bridge.bridge-nf-call-ip6tables = 0
#net.bridge.bridge-nf-call-iptables = 0
#net.bridge.bridge-nf-call-arptables = 0
# Controls the maximum size of a message, in bytes
kernel.msgmnb = 65536
# Controls the default maxmimum size of a mesage queue
kernel.msgmax = 65536
# Controls the maximum shared segment size, in bytes
kernel.shmmax = 50000000000
# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 32534377267
vm.nr_hugepages = 50000
vm.hugetlb_shm_group = 27
kernel.sched_autogroup_enabled = 0
kernel.core_pattern = /tmp/core-%e-%s-%u-%g-%p-%t
rbh_daily.conf
--------------
##########################################
# Robinhood configuration file template #
##########################################
# Global configuration
General
{
fs_path = "/scratch/daint" ;
lock_file = "/var/locks/robinhood.lock" ;
stay_in_fs = TRUE ;
check_mounted = TRUE ;
}
# Log configuration
Log
{
debug_level = EVENT ;
log_file = "/var/log/robinhood.daint.log" ;
report_file = "/var/log/robinhood_reports.daint.log" ;
alert_file = "/var/log/robinhood_alerts.daint.log" ;
stats_interval = 1min ;
batch_alert_max = 5000 ;
alert_show_attrs = FALSE ;
log_procname = TRUE;
log_hostname = TRUE;
log_module = TRUE;
}
# List Manager configuration
ListManager
{
commit_behavior = autocommit ;
connect_retry_interval_min = 1 ;
connect_retry_interval_max = 30 ;
user_acct = enabled ;
group_acct = enabled ;
MySQL
{
server = "" ;
db = "" ;
user = "" ;
password_file = "" ;
engine = InnoDB ;
}
}
# Policies configuration
db_update_policy
{
md_update = on_event_periodic(1sec,1min) ;
path_update = on_event ;
}
# Entry Processor configuration
EntryProcessor
{
Alert Too_many_entries_in_directory
{
type == directory
and
dircount > 900000
}
Alert Large_file
{
type == file
and
size > 200GB
}
nb_threads = 16 ;
max_pending_operations = 100000 ;
max_batch_size = 1;
match_classes = TRUE;
detect_fake_mtime = FALSE;
}
# FS Scan configuration
FS_Scan
{
min_scan_interval = 12h ;
max_scan_interval = 1d ;
nb_threads_scan = 16 ;
scan_retry_delay = 1h ;
scan_op_timeout = 1h ;
exit_on_timeout = TRUE ;
spooler_check_interval = 1min ;
nb_prealloc_tasks = 256 ;
Ignore
{
# ignore ".snapshot" and ".snapdir" directories (don't scan
them)
type == directory
and
( name == ".snapdir" or name == ".snapshot" )
}
}
ChangeLog
{
MDT
{
mdt_name = "MDT0000" ;
reader_id = "cl1" ;
}
# clear changelog every 1024 records:
batch_ack_count = 1024 ;
force_polling = ON ;
polling_interval = 1s ;
queue_max_size = 1000 ;
queue_max_age = 5s ;
queue_check_interval = 1s ;
}
Purge_Policies
{
policy default
{
condition { last_access > 30d }
}
}
Purge_Parameters
{
nb_threads_purge = 16 ;
post_purge_df_latency = 1min ;
}
Purge_Trigger
{
trigger_on = global_usage ;
high_watermark_pct = 59% ;
low_watermark_pct = 40% ;
check_interval = 24h ;
alert_high = TRUE;
notify_hw = TRUE;
alert_lw = TRUE;
}
rmdir_policy {
age_rm_empty_dirs = 30d ;
}
rmdir_parameters {
runtime_interval = 12h ;
nb_threads_rmdir = 8 ;
}
--
----------------------------------------------------------------------
Carmelo Ponti System Engineer CSCS Swiss Center for Scientific
Computing Via Trevano 131 Email: [email protected] CH-6900 Lugano
http://www.cscs.ch Phone: +41 91 610 82 15/Fax: +41 91 610 82 82
----------------------------------------------------------------------
confirm 97ab7eb8c39bfdfd324dbb0b60781f063567be1f.eml
Subject:
confirm 97ab7eb8c39bfdfd324dbb0b60781f063567be1f
From:
[email protected]
If you reply to this message, keeping the Subject: header intact,
Mailman will discard the held message. Do this if the message is
spam. If you reply to this message and include an Approved: header
with the list password in it, the message will be approved for posting
to the list. The Approved: header can also appear in the first line
of the body of the reply.
------------------------------------------------------------------------------
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15&utm_medium=email&utm_campaign=VA_SF
_______________________________________________
robinhood-support mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/robinhood-support