On 18/12/2018 18:28, Oliver Freyermuth wrote:
We have yet to observe these hangs, we are running this with ~5 VMs with ~10 disks for
about half a year now with daily snapshots. But all of these VMs have very
"low" I/O,
since we put anything I/O intensive on bare metal (but with automated
provisioning of course).
So I'll chime in on your question, especially since there might be VMs on our
cluster in the future where the inner OS may not be running an agent.
Since we did not observe this yet, I'll also add: What's your "scale", is it
hundreds of VMs / disks? Hourly snapshots? I/O intensive VMs?
5 hosts, 15 VMs, daily snapshots. I/O is variable (customer workloads);
usually not that high, but it can easily peak at 100% when certain
things happen. We don't have great I/O performance (RBD over 1gbps links
to HDD OSDs).
I'm poring through monitoring graphs now and I think the issue this time
around was just too much dirty data in the page cache of a guest. The VM
that failed spent 3 minutes flushing out writes to disk before its I/O
was quiesced, at around 100 IOPS throughput (the actual data throughput
was low, though, so small writes). That exceeded our timeout and then
things went south from there.
I wasn't sure if fsfreeze did a full sync to disk, but given the I/O
behavior I'm seeing that seems to be the case. Unfortunately coming up
with an upper bound for the freeze time seems tricky now. I'm increasing
our timeout to 15 minutes, we'll see if the problem recurs.
Given this, it makes even more sense to just avoid the freeze if at all
reasonable. There's no real way to guarantee that a fsfreeze will
complete in a "reasonable" amount of time as far as I can tell.
--
Hector Martin (hec...@marcansoft.com)
Public Key: https://marcan.st/marcan.asc
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com