Sorry for replaying late, I answered in-line
On 10/21/18 6:00 AM, Andreas Dilger wrote:
It would be useful to post information like this on wiki.lustre.org so they can
be found more easily by others. There are already some ZFS tunings there (I
don't have the URL handy, just on a plane), so it might be useful to include
some information about the hardware and workload to give context to what this
is tuned for.
Even more interesting would be to see if there is a general set of tunings that
people agree should be made the default? It is even better when new users
don't have to seek out the various tuning parameters, and instead get good
performance out of the box.
A few comments inline...
On Oct 19, 2018, at 17:52, Riccardo Veraldi <riccardo.vera...@cnaf.infn.it>
wrote:
On 10/19/18 12:37 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
On Oct 17, 2018, at 7:30 PM, Riccardo Veraldi <riccardo.vera...@cnaf.infn.it>
wrote:
anyway especially regarding the OSSes you may eventually need some ZFS module parameters
optimizations regarding vdev_write and vdev_read max to increase those values higher than
default. You may also disable ZIL, change the redundant_metadata to "most"
atime off.
I could send you a list of parameters that in my case work well.
Riccardo,
Would you mind sharing your ZFS parameters with the mailing list? I would be
interested to see which options you have changed.
this worked for me on my high performance cluster
options zfs zfs_prefetch_disable=1
This matches what I've seen in the past - at high bandwidth under concurrent
client load the prefetched data on the server is lost, and just causes needless
disk IO that is discarded.
options zfs zfs_txg_history=120
options zfs metaslab_debug_unload=1
#
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
#
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
#
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
This is interesting. Is this actually setting the maximum TXG age up to 30s?
yes, I think the default is 5 seconds.
options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32
##############
these the zfs attributes that I changed on the OSSes:
zfs set mountpoint=none $ostpool
zfs set sync=disabled $ostpool
zfs set atime=off $ostpool
zfs set redundant_metadata=most $ostpool
zfs set xattr=sa $ostpool
zfs set recordsize=1M $ostpool
The recordsize=1M is already the default for Lustre OSTs.
Did you disable multimount, or just not include it here? That is fairly
important for any multi-homed ZFS storage, to prevent multiple imports.
#################
these the ko2iblnd parameters for FDR Mellanox IB interfaces
options ko2iblnd timeout=100 peer_credits=63 credits=2560 concurrent_sends=63
ntx=2048 fmr_pool_size=1280 fmr_flush_trigger=1024 ntx=5120
You have ntx= in there twice...
yes it is a mistake I typed it two times
If this provides a significant improvement for FDR, it might make sense to add in
machinery to lustre/conf/{ko2iblnd-probe,ko2iblnd.conf} to have a new alias
"ko2iblnd-fdr" set these values on Mellanox FDB IB cards by default?
I found it it works better with FDR.
Anyway most of the tunings I did were taken here and there reading what
other people did. So mostly from here:
* https://lustre.ornl.gov/lustre101-courses/content/C1/L5/LustreTuning.pdf
* https://www.eofs.eu/_media/events/lad15/15_chris_horn_lad_2015_lnet.pdf
* https://lustre.ornl.gov/ecosystem-2015/documents/LustreEco2015-Tutorial2.pdf
And by the way the most effective tweaks were after reading Rick Mohr
adviceĀ in LustreTuning.pdf, Thanks Rick!
############
these the ksocklnd paramaters
options ksocklnd sock_timeout=100 credits=2560 peer_credits=63
##############
these other parameters that I did tweak
echo 32 > /sys/module/ptlrpc/parameters/max_ptlrpcds
echo 3 > /sys/module/ptlrpc/parameters/ptlrpcd_bind_policy
This parameter is marked as obsolete in the code.
Yes I should fix my configuration and use the new parameters
lctl set_param timeout=600
lctl set_param ldlm_timeout=200
lctl set_param at_min=250
lctl set_param at_max=600
###########
Also I run this script at boot time to redefine IRQ assignments for hard drives
spanned across all CPUs, not needed for kernel > 4.4
#!/bin/sh
# numa_smp.sh
device=$1
cpu1=$2
cpu2=$3
cpu=$cpu1
grep $1 /proc/interrupts|awk '{print $1}'|sed 's/://'|while read int
do
echo $cpu > /proc/irq/$int/smp_affinity_list
echo "echo CPU $cpu > /proc/irq/$a/smp_affinity_list"
if [ $cpu = $cpu2 ]
then
cpu=$cpu1
else
((cpu=$cpu+1))
fi
done
Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org