Well, you make me curious.
Here simple test. I've done it now on XCP 1.6
1. Run fio (any kind of config, just to create some load - in my case it
was 16 concurrent operations).
2. Put domain to pause state (xl pause) - to caught it during some IO
'on-flight'
3. wait for 150 seconds or more
4. resume domain
amaizing, I've got some nasty lags after resume, but no IO errors. I've
repeat that operation few times:
1. No problems.
2. Stale RCU trace: [5040301.442715] INFO: rcu_sched_state detected
stall on CPU 0 (t=50594 jiffies)
3. No problems.
That's really strange, because I saw few times io errors on virtual
machines without any signs of problems for the dom0.
I'll research tat topic more on next week and will report results here.
22.06.2013 02:55, Nathan March пишет:
On 6/21/2013 1:16 AM, George Shuklin wrote:
I'm talking not about dom0, mostly, but domU kernel. If IO takes more
than 120 seconds, it will processed as 'io timeout'. And this timeout
is hardcoded (no /sys|/proc variables).
If you getting IO timeout in less than 2 minutes - that different
question.
Hi George,
Sorry if I'm misunderstanding, but I don't believe it's a domU issue,
as I've run identical virtual machines on our existing xen cluster and
can take storage away from the dom0 for over 45 minutes without a
problem. If the domU kernel was responsible for timing out the IO
requests I'd be seeing some sort of kernel error on my domU's in this
situation. Instead they just hang waiting for the IO and gracefully
recover once it comes back (albeit, with very very high load averages
as requests back up). I've done no patching/changes to our existing
systems to get it to work like this, it just ended up that way. We're
running stock 3.2.28 dom0's and 2.6.32.60 domU's, so having to hack a
domU kernel on XCP to achieve the same thing seems strange?
That being said, it is a 120s timeout that I'm hitting (NFS is me
echoing to kmsg when I pull connectivity for easy timestamp purposes)
dom0:
[ 2594.069594] NFS
[ 2609.574285] nfs: server 10.1.26.1 not responding, timed out
[ 2717.464716] end_request: I/O error, dev tda, sector 18882056
domu:
[82688.790260] NFS
[82812.678888] end_request: I/O error, dev xvda, sector 18882056
So here the dom0 is timing out and the I/O error is returned back to
the domU and then it goes read only.
If I manually unmount + remount the SR on the dom0 with "-o hard", I
would expect the timeout to go away as nfs is no longer returning the
timeout back to xcp. Instead what I see are the same 120s timeouts,
making me think that this timeout is coming from some other layer
instead?
Thanks!
- Nathan
_______________________________________________
Xen-api mailing list
Xen-api@lists.xen.org
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api
_______________________________________________
Xen-api mailing list
Xen-api@lists.xen.org
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api