Howdy,
I just wanted to take a moment to thank several people, working quite a few
hours today putting together this advisory:
Avati - for committing the fixes to the source tree in the first place
Vikas - for helping me run down the github commits and issue tracker notes
AB - for explaining to me in some detail how the race conditions happened
...and also for correcting my initial mistakes :)
These types of advisories are new for us, and we're open to your suggestions on
the best way to communicate this information going forward.
Please do not hesitate to contact me if you should need assistance.
Thanks,
John Mark Walker
Gluster Community Guy
From: gluster-users-boun...@gluster.org [gluster-users-boun...@gluster.org] on
behalf of Craig Carl [cr...@gluster.com]
Sent: Friday, June 24, 2011 3:50 PM
To: Gluster Users
Subject: Re: [Gluster-users] UPDATED: Enomaly users, please read -
All -
A much more detailed update from engineering is below. The range of affected
versions is bigger that I thought, I apologize for not getting that right
earlier. Please let me know if you have any questions.
VERSIONS AFFECTED: All Current GlusterFS releases up to 3.1.5 and 3.2.1
SEVERITY: For Enomaly users, a conflict results in denial of service but no
loss of data. We have not observed the same race conditions in non-Enomaly
environments.
CAUSE: Gluster and Enomaly incompatibility
There is an incompatibility between GlusterFS and Enomaly and how they perform
directory operations. Enomaly's agents monitor node failures and migrate VMs
automatically. These distributed agents communicate with each other by
performing directory operations, such as mkdir, for the purpose of inter-node
communication. These directory operations triggers a race condition GlusterFS,
locking up a storage node. Enomaly agents get confused and propagate the error
across the site even for a single node failure.
There was a race condition related to changing GFID's that was fixed in 3.1.5 -
http://blog.gluster.com/2011/06/glusterfs-3-1-5-now-available/ - this is a
partial fix for the behavior described. Race conditions can occur again.
After fixing the initial outage, if any node fails, you will see the issue
again. Upgrading GlusterFS to 3.1.5 and restarting GlusterFS and Enomaly is a
temporary fix. A permanent solution requires the Gluster 3.1.6 or 3.2.2 release
(coming soon, see Solution below).
Other possible race conditions are fixed in the current source tree, subject to
further testing.
SOLUTION:
This issue has been fixed in our source repository (https://github.com/gluster)
and will be released soon with 3.1.6 and 3.2.2. If you'd like to help test the
current fixes, please contact us before you do anything foolish (read: use in
production). Users who test these patches in their non-critical, development
environments and send us feedback will each get a Gluster t-shirt, maybe even a
hat!!
We will send out another alert as soon as both releases are GA.
--
Thanks,
Craig Carl
Senior Systems Engineer | Gluster
408-829-9953callto:+1408-829-9953 | San Francisco, CA
http://www.gluster.com/gluster-for-aws/
Craig Carlmailto:cr...@gluster.com
June 24, 2011 1:57 PM
All -
Gluster has identified a serious issue that affects anyone hosting VM images
for the Enomaly Elastic Cloud Platform with Gluster. This issue is limited to
Gluster versions =3.1.4. We strongly encourage anyone using Enomaly and
Gluster to upgrade to Gluster version 3.1.5 or higher.
What causes the failure -
Use a distribute-replicate volume.
Mount using either NFS or the GlusterFS.
Fail one of replica nodes.
** Production is unaffected at this point.
Restart the failed node.
** All the virtual machines fail.
** The ecpagent service on each hypervisor will constantly restart.
Root cause -
Enomaly uses a locking mechanism in addition to and above the standard POSIX
locking to make sure that a VM never starts on two servers at the same time.
When a Enomaly sever starts a VM it writes a file (randomstuff.tmp to the
directory. The combination of self-heal and a race between the ecpagents on the
hypervisors results in the VMs failing to start. No data is lost or damaged.
Again, this issue is specific to Enomaly. Enomaly users should immediatly
upgrade to a version of Gluster =3.1.5.
http://download.gluster.com/pub/gluster/glusterfs/
___
Gluster-users mailing list
Gluster-users@gluster.orgmailto:Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users
___
Gluster-users mailing list
Gluster-users@gluster.org
http://gluster.org/cgi-bin/mailman/listinfo/gluster-users