On 03/13/2017 11:07 AM, Dan van der Ster wrote:
On Sat, Mar 11, 2017 at 12:21 PM, <cephmailingl...@mosibi.nl> wrote:

The next and biggest problem we encountered had to do with the CRC errors on 
the OSD map. On every map update, the OSDs that were not upgraded yet, got that 
CRC error and asked the monitor for a full OSD map instead of just a delta 
update. At first we did not understand what exactly happened, we ran the 
upgrade per node using a script and in that script we watch the state of the 
cluster and when the cluster is healthy again, we upgrade the next host. Every 
time we started the script (skipping the already upgraded hosts) the first 
host(s) upgraded without issues and then we got blocked I/O on the cluster. The 
blocked I/O went away within a minute of 2 (not measured). After investigation 
we found out that the blocked I/O happened when nodes where asking the monitor 
for a (full) OSD map and that resulted shortly in a full saturated network link 
on our monitor.


Thanks for the detailed upgrade report. I wanted to zoom in on this
CRC/fullmap issue because it could be quite disruptive for us when we
upgrade from hammer to jewel.

I've read various reports that the fool proof way to avoid the full
map DoS would be to upgrade all OSDs to jewel before the mon's.
Did anyone have success with that workaround? I'm cc'ing Bryan because
he knows this issue very well.

With https://github.com/ceph/ceph/pull/13131 merged into 10.2.6, this issue shouldn't be a problem (at least we don't see it anymore).

--
Piotr Dałek
piotr.da...@corp.ovh.com
https://www.ovh.com/us/
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to