One of the Red hat QE engineers (Nag Pavan) found a day 1 bug in entry self-heal where the file with good data can be replaced with file with bad data when renames + self-heal is involved in a particular way.
Sample steps (From the bz): 1) have a plain replica volume with 2 bricks. start the volume and mount it. 2) mkdir dir && mkdir newdir && touch file1 3) bring first brick down 4) echo abc > dir/file1 5) bring the first brick back up and quickly bring the second brick down before self-heal can be triggered. 6) do mv dir/file1 newdir/file2 <<--- note that this is empty file. Now bring the second brick back up. If entry self-heal of 'dir' happens first then it deletes the file1 with content 'abc' now when 'newdir' heal happens it leads to creation of empty file and the data in the file is lost. Same can be achieved using 'link' + 'unlink' as well. The main reason for this problem is that afr entry-self-heal at the moment doesn't care completely about link-counts before deleting the final link of an inode, so it always does unlink and recreates the file and does data heals. In this corner case unlink happens on the good copy of the file and we either lose data or get stale data based on what is the data present on the sink file. Solution we are proposing is the following: 1) Posix will maintain a hidden directory '.glusterfs/anoninode'(We can call it lost+found as well) directory which will be used by afr/ec for keeping the 'inodes' until their names are resolved. 2) Both afr and ec when they need to heal a directory and a 'name' has to be deleted but on the other bricks if the inode is present, it renames this file as 'anoninode/<gfid-of-file/dir>' instead of doing unlink/rmdir on it. 3) For files: a) Both afr, ec already has logic to do 'link' instead of new file creation if a gfid already exists in the brick. So when a name is resolved it does exactly what it does now. b) Self-heal daemon will periodically crawl the first level of 'anoninode' directory to make sure it deletes the 'inodes' represented as files with gfid-string as names whenever the link count is > 1. It will also delete the files if the gfid cease to exist on the other bricks. 5) For directories: a) both afr and ec need to perform 'rename' of the 'anoninode/dir-gfid' to the name it will be resolved to as part of entry self-heal, instead of 'mkdir'. b) If self-heal daemon crawl detects that a directory is deleted on the other bricks, then it has to scan the files inside the deleted directory and move them into 'anoninode' if the gfid of the file/directory exists on the other bricks. Otherwise they can be safely deleted. Please let us know if you see any issues with this approach. -- Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel