> On Aug. 12, 2014, 11:16 a.m., daan Hoogland wrote: > > c4b78c3aaa8df20c8e892b9d5108d8f34f96ed0c on 4.4
37baddd7212717f259c33b3bb75720d718b92d2c on master - daan ----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/24598/#review50304 ----------------------------------------------------------- On Aug. 12, 2014, 11:21 a.m., Joris van Lieshout wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/24598/ > ----------------------------------------------------------- > > (Updated Aug. 12, 2014, 11:21 a.m.) > > > Review request for cloudstack, Alex Huang, anthony xu, daan Hoogland, edison > su, Kishan Kavala, Min Chen, Sanjay Tripathi, and Hugo Trippaers. > > > Bugs: CLOUDSTACK-7319 > https://issues.apache.org/jira/browse/CLOUDSTACK-7319 > > > Repository: cloudstack-git > > > Description > ------- > > We noticed that the dd process was way to agressive on Dom0 causing all kinds > of problems on a xenserver with medium workloads. > ACS uses the dd command to copy incremental snapshots to secondary storage. > This process is to heavy on Dom0 resources and even impacts DomU performance, > and can even lead to domain freezes (including Dom0) of more then a minute. > We've found that this is because the Dom0 kernel caches the read and write > operations of dd. > Some of the issues we have seen as a consequence of this are: > - DomU performance/freezes > - OVS freeze and not forwarding any traffic > - Including LACPDUs resulting in the bond going down > - keepalived heartbeat packets between RRVMs not being send/received > resulting in flapping RRVM master state > - Braking snapshot copy processes > - the xenserver heartbeat script reaching it's timeout and fencing the server > - poolmaster connection loss > - ACS marking the host as down and fencing the instances even though they are > still running on the origional host resulting in the same instance running on > to hosts in one cluster > - vhd corruption are a result of some of the issues mentioned above > We've developed a patch on the xenserver scripts > /etc/xapi.d/plugins/vmopsSnapshot that added the direct flag of both input > and output files (iflag=direct oflag=direct). > Our test have shown that Dom0 load during snapshot copy is way lower. > > We believe Hot-fix 4 for XS62 sp1 contains a similar fix but for the sparse > dd process used for the first copy of a chain. > > http://support.citrix.com/article/CTX140417 > > == begin quote == > Copying a virtual disk between SRs uses the unbuffered I/O to avoid polluting > the pagecache in the Control Domain (dom0). This reduces the dom0 vCPU > overhead and allows the pagecache to work more effectively for other > operations. > == end quote == > > > Diffs > ----- > > scripts/vm/hypervisor/xenserver/vmopsSnapshot 5fd69a6 > > Diff: https://reviews.apache.org/r/24598/diff/ > > > Testing > ------- > > We are running this fix in our beta and prod environment (both using ACS > 4.3.0) with great success. > > > Thanks, > > Joris van Lieshout > >