Re: Cloudstack + XenServer 6.2 + NetApp in production

Tim Mackey Sun, 15 Feb 2015 13:52:54 -0800

Here's a KB which covers how to change the *XenServer* HA setting (not the
CloudStack one): http://support.citrix.com/article/CTX139166.  It would be
good to check /var/log/xha.log to see if any issues were logged there.
Also note that with HA you want always have your hosts NTP sync'd.  With
the default timeout being 30 seconds, I'd start by verifying from your
NetApp admins how long the head was actually offline.  I'd also look into
any network config issues (assuming you've bonded your storage network).


-tim

On Sun, Feb 15, 2015 at 4:28 PM, Adriano Paterlini <paterl...@usp.br> wrote:

> Yiping,
>
> We do have a production environment with similar configuration, you can
> check some parameters and logs.
>
> First of all, xenserver nfs timeout will occur every time nfs server takes
> more than 13.3 (40.0/3.0) seconds to answer read or write nfs calls, this
> is defined as SOFTMOUNT_TIMEOUT at /opt/xensource/sm/nfs.py. There are some
> xenserver forum discussions about changing this parameter, my conclusion
> that its not recommended, the consequence would be virtual machines going
> into ready only mode, unless vm parameters are also modified, linux
> defaults usually are 30 seconds. NFS timeouts are shown at
> /var/log/kern.log.
>
> However, the timeout itself does not cause host reboot, the reboot is
> probably due to cloudstack HA storage fence, just as Tim mentioned, storage
> fence is enforced at the script  /opt/cloud/bin/xenheartbeat.sh, you can
> check for the log entries to confirm if it was the case. If its is really
> the case you case adjust the cloudstack global settings parameters
> xenserver.heartbeat.interval and xenserver.heartbeat.timeout to accommodate
> planned maintenance and even automatic storage side HA, you should check
> with Netapp for recommend values for your environment, takeover/giveback
> delays may vary according to controller version and even current controller
> load, Netapp documentation mention 180 seconds as maximum delay. Also check
> if script is running correctly #ps -aux | grep heartbeat, it should take 3
> parameters, if not you may be affected by
> https://issues.apache.org/jira/browse/CLOUDSTACK-7184.
>
> Hope the comments help your decision.
>
>
> Regards,
> Adriano
>
>
> On Sun, Feb 15, 2015 at 12:38 PM, <cyr...@usp.br> wrote:
>
> > FYI
> >
> > Sent from my iPhone
> >
> > Begin forwarded message:
> >
> > *From:* Yiping Zhang <yzh...@marketo.com>
> > *Date:* February 15, 2015 at 2:00:05 AM GMT-2
> > *To:* "users@cloudstack.apache.org" <users@cloudstack.apache.org>
> > *Subject:* *Re: Cloudstack + XenServer 6.2 + NetApp in production*
> > *Reply-To:* <users@cloudstack.apache.org>
> >
> > Tim,
> >
> > Thanks, for the reply.
> >
> > In our case, the NetApp cluster as a whole did not fail.  The NetApp
> > cluster failover was happening because Operations team was performing a
> > scheduled maintenance, this is normal behavior. To best of my knowledge,
> > NetApp head failover should take anywhere 10-15 seconds.
> >
> > As you guessed correctly, our XenServer resource pool does have HA
> > enabled, and HA shared SR is indeed on the same NetApp cluster as the
> > primary storage SR.  Though I am not sure if enabling xen pool HA is the
> > cause of xenserver¹s  rebooting under this particular scenario.
> >
> > I am not sure if I understand your statement that "In that case, HA would
> > detect the storage failure and fence the XenServer host².  Can you
> > elaborate a little more on this statement?
> >
> > Thanks again,
> >
> > Yiping
> >
> >
> > On 2/14/15, 6:26 AM, "Tim Mackey" <tmac...@gmail.com> wrote:
> >
> > Yiping,
> >
> >
> > The specific problem covered by that note was solved a long time ago.
> >
> > Timeouts can be caused by a number of things, and if the entire NetApp
> >
> > cluster went offline, the XenServer host would be impacted.  Since you
> are
> >
> > experiencing a host reboot when this happens, I suspect you have
> XenServer
> >
> > HA enabled with the heartbeat on the same NetApp cluster.  In that case,
> >
> > HA
> >
> > would detect the storage failure and fence the XenServer host.
> >
> >
> > The solution here would be to understand why your NetApp cluster failed
> >
> > during scheduled maintenance. Something in your configuration has created
> >
> > a
> >
> > single point of failure. If you've enabled HA, I also would like to
> >
> > understand why you've chosen to do that.  Going slightly commercial for a
> >
> > second, I would also advise you to look into a commercial support
> contract
> >
> > for your production XenServer hosts. That team is going to be able to go
> >
> > deeper, and much quicker, when production issues arise than this list.
> >
> > NetApp and XenServer is used in a very large number of deployments, so if
> >
> > there is something wrong they'll be more likely to know. For example,
> >
> > there
> >
> > could be a set of XenServer or OnTap patches to help sort this out.
> >
> >
> > -tim
> >
> >
> > On Fri, Feb 13, 2015 at 7:36 PM, Yiping Zhang <yzh...@marketo.com>
> wrote:
> >
> >
> > Hi, all:
> >
> >
> > I am wondering if any one is running their CloudStack in production
> >
> > deployments with  XenServer 6.2 + NetApp clusters ?
> >
> >
> > Recently, in our non production deployment (rhel 6.6 + CS 4.3.0 +
> >
> > XenServer 6.2 cluster + NetApp cluster), all our XenServer rebooted
> >
> > automatically because of NFS timeout, when our NetApp cluster failover
> >
> > happened during a scheduled filer maintenance. My google search turned
> >
> > up
> >
> > this Citrix hot fix: http://support.citrix.com/article/CTX135623 for
> >
> > XenServer 6.0.2, and this post about XenServer 6.2:
> >
> > http://www.gossamer-threads.com/lists/xen/devel/320020 .
> >
> >
> > Obviously the problem still exists for XenServer 6.2 and we are very
> >
> > concerned about going to production deployment based on this technology
> >
> > stack.
> >
> >
> > If anyone has a similar setup, please share your experiences.
> >
> >
> > Thanks,
> >
> >
> > Yiping
> >
> >
> >
> >
> >
> >
>
>
> --
> Adriano Arantes Paterlini
> Analista de Sistemas
> Centro de Tecnologia da Informação - CeTI-SP
> Superintendência de Tecnologia da Informação - STI
> Universidade de São Paulo
>
> Fone: +55 (11) 3091-0494
>
> Av. Professor Luciano Gualberto, 71, tv. 3
> Cidade Universitária - São Paulo / SP
>

Re: Cloudstack + XenServer 6.2 + NetApp in production

Reply via email to