On 03/13/2017 11:07 AM, Dan van der Ster wrote:
On Sat, Mar 11, 2017 at 12:21 PM, <cephmailingl...@mosibi.nl> wrote:
The next and biggest problem we encountered had to do with the CRC errors on
the OSD map. On every map update, the OSDs that were not upgraded yet, got that
CRC error and asked the monitor for a full OSD map instead of just a delta
update. At first we did not understand what exactly happened, we ran the
upgrade per node using a script and in that script we watch the state of the
cluster and when the cluster is healthy again, we upgrade the next host. Every
time we started the script (skipping the already upgraded hosts) the first
host(s) upgraded without issues and then we got blocked I/O on the cluster. The
blocked I/O went away within a minute of 2 (not measured). After investigation
we found out that the blocked I/O happened when nodes where asking the monitor
for a (full) OSD map and that resulted shortly in a full saturated network link
on our monitor.
Thanks for the detailed upgrade report. I wanted to zoom in on this
CRC/fullmap issue because it could be quite disruptive for us when we
upgrade from hammer to jewel.
I've read various reports that the fool proof way to avoid the full
map DoS would be to upgrade all OSDs to jewel before the mon's.
Did anyone have success with that workaround? I'm cc'ing Bryan because
he knows this issue very well.
With https://github.com/ceph/ceph/pull/13131 merged into 10.2.6, this issue
shouldn't be a problem (at least we don't see it anymore).
--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com