I could easily see that being the case, especially with Micron as a common thread, but it appears that I am on the latest FW for both the SATA and the NVMe:
> $ sudo ./msecli -L | egrep 'Device|FW' > Device Name : /dev/sda > FW-Rev : D0MU027 > Device Name : /dev/sdb > FW-Rev : D0MU027 > Device Name : /dev/sdc > FW-Rev : D0MU027 > Device Name : /dev/sdd > FW-Rev : D0MU027 > Device Name : /dev/sde > FW-Rev : D0MU027 > Device Name : /dev/sdf > FW-Rev : D0MU027 > Device Name : /dev/sdg > FW-Rev : D0MU027 > Device Name : /dev/sdh > FW-Rev : D0MU027 > Device Name : /dev/sdi > FW-Rev : D0MU027 > Device Name : /dev/sdj > FW-Rev : D0MU027 > Device Name : /dev/nvme0 > FW-Rev : 0091634 D0MU027 and 1634 are the latest reported FW from Micron, current as of 04/12/2017 and 12/07/2016, respectively. Could be current FW doesn’t play nice, so thats on the table. But for now, its a thread that can’t be pulled any further. Appreciate the feedback, Reed > On Jul 6, 2017, at 1:18 PM, Peter Maloney > <peter.malo...@brockmann-consult.de> wrote: > > Hey, > > I have some SAS Micron S630DC-400 which came with firmware M013 which did the > same or worse (takes very long... 100% blocked for about 5min for 16GB > trimmed), and works just fine with firmware M017 (4s for 32GB trimmed). So > maybe you just need an update. > > Peter > > > > On 07/06/17 18:39, Reed Dier wrote: >> Hi Wido, >> >> I came across this ancient ML entry with no responses and wanted to follow >> up with you to see if you recalled any solution to this. >> Copying the ceph-users list to preserve any replies that may result for >> archival. >> >> I have a couple of boxes with 10x Micron 5100 SATA SSD’s, journaled on >> Micron 9100 NVMe SSD’s; ceph 10.2.7; Ubuntu 16.04 4.8 kernel. >> >> I have noticed now twice that I’ve had SSD’s flapping due to the fstrim >> eating up the io 100%. >> It eventually righted itself after a little less than 8 hours. >> Noout flag was set, so it didn’t create any unnecessary rebalance or whatnot. >> >> Timeline showing that only 1 OSD ever went down at a time, but they seemed >> to go down in a rolling fashion during the fstrim session. >> You can actually see in the OSD graph all 10 OSD’s on this node go down 1 by >> 1 over time. >> >> >> And the OSD’s were going down because of: >> >>> 2017-07-02 13:47:32.618752 7ff612721700 1 heartbeat_map is_healthy >>> 'OSD::osd_op_tp thread 0x7ff5ecd0c700' had timed out after 15 >>> 2017-07-02 13:47:32.618757 7ff612721700 1 heartbeat_map is_healthy >>> 'FileStore::op_tp thread 0x7ff608d9e700' had timed out after 60 >>> 2017-07-02 13:47:32.618760 7ff612721700 1 heartbeat_map is_healthy >>> 'FileStore::op_tp thread 0x7ff608d9e700' had suicide timed out after 180 >>> 2017-07-02 13:47:32.624567 7ff612721700 -1 common/HeartbeatMap.cc >>> <http://heartbeatmap.cc/>: In function 'bool >>> ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, >>> time_t)' thread 7ff612721700 time 2017-07-02 13:47:32.618784 >>> common/HeartbeatMap.cc <http://heartbeatmap.cc/>: 86: FAILED assert(0 == >>> "hit suicide timeout") >> >> >> I am curious if you were able to nice it or something similar to mitigate >> this issue? >> Oddly, I have similar machines with Samsung SM863a’s with Intel P3700 >> journals that do not appear to be affected by the fstrim load issue despite >> identical weekly cron jobs enabled. Only the Micron drives (newer) have had >> these issues. >> >> Appreciate any pointers, >> >> Reed >> >>> Wido den Hollander wido at 42on.com >>> <mailto:ceph-users%40lists.ceph.com?Subject=Re%3A%20%5Bceph-users%5D%20Watch%20for%20fstrim%20running%20on%20your%20Ubuntu%20systems&In-Reply-To=%3C5486BF08.3010505%4042on.com%3E> >>> Tue Dec 9 01:21:16 PST 2014 >>> Hi, >>> >>> Last sunday I got a call early in the morning that a Ceph cluster was >>> having some issues. Slow requests and OSDs marking each other down. >>> >>> Since this is a 100% SSD cluster I was a bit confused and started >>> investigating. >>> >>> It took me about 15 minutes to see that fstrim was running and was >>> utilizing the SSDs 100%. >>> >>> On Ubuntu 14.04 there is a weekly CRON which executes fstrim-all. It >>> detects all mountpoints which can be trimmed and starts to trim those. >>> >>> On the Intel SSDs used here it caused them to become 100% busy for a >>> couple of minutes. That was enough for them to no longer respond on >>> heartbeats, thus timing out and being marked down. >>> >>> Luckily we had the "out interval" set to 1800 seconds on that cluster, >>> so no OSD was marked as "out". >>> >>> fstrim-all does not execute fstrim with a ionice priority. From what I >>> understand, but haven't tested yet, is that running fstrim with ionice >>> -c Idle should solve this. >>> >>> It's weird that this issue didn't come up earlier on that cluster, but >>> after killing fstrim all problems we resolved and the cluster ran >>> happily again. >>> >>> So watch out for fstrim on early Sunday mornings on Ubuntu! >>> >>> -- >>> Wido den Hollander >>> 42on B.V. >>> Ceph trainer and consultant >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com