Re: [Gluster-devel] bad file access (bit-rot + AFR)

Raghavendra Bhat Mon, 29 Jun 2015 21:51:48 -0700

On 06/27/2015 03:28 PM, Venky Shankar wrote:

On 06/27/2015 02:32 PM, Raghavendra Bhat wrote:
Hi,
There is a patch that is submitted for review to deny access toobjects which are marked as bad by scrubber (i.e. the data of theobject might have been corrupted in the backend).
http://review.gluster.org/#/c/11126/10
http://review.gluster.org/#/c/11389/4
The above 2 patch sets solve the problem of denying access to thebad objects (they have passed regression and received a +1 fromvenky). But in our testing we found that there is a race window(depending upon the scrubber frequency the race window can be larger)where there is a possibility of self-heal daemon healing the contentsof the bad file before scrubber can mark it as bad.
I am not sure if the data truly gets corrupted in the backend, thereis a chance of hitting this issue. But in our testing to simulatebackend corruption we modify the contents of the file directly in thebackend. Now in this case, before the scrubber can mark the object asbad, the self-heal daemon kicks in and heals the contents of the badfile to the good copy. Or before the scrubber marks the file as bad,if the client accesses it AFR finds that there is a mismatch inmetadata (since we modified the contents of the file in the backend)and does data and metadata self-healing, thus copying the contents ofthe bad copy to good copy. And from now onwards the clients accessingthat object always gets bad data.
I understand from Ravi (ranaraya@) that AFR-v2 would chose the"biggest" file as the source, provided that afr xattrs are "clean"(AFR-v1 would give back EIO). If a file is modified directly from thebrick but leaves the size unchanged, contents can be served fromeither copy. For self-heal to detect anomalies, there needs to beverification (checksum/signature) at each stage of it's operation. Butthis might be too heavy on the I/O side. We could still cache mtime[but update on client I/O] after pre-check, but this still would notcatch bit flips (unless a filesystem scrub is done).
Thoughts?

Yes. Even if wants to verify just before healing the file, the timetaken to verify the checksum might be large if the file size is large.It might affect the self-heal performance.


Regards,
Raghavendra Bhat

Pranith?Do you have any solution for this? Venky and me are trying tocome up with a solution for this.
But does this issue block the above patches in anyway? (Those 2patches are still needed to deny access to objects once they aremarked as bad by scrubber).
Regards,
Raghavendra Bhat
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel
_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel


_______________________________________________
Gluster-devel mailing list
Gluster-devel@gluster.org
http://www.gluster.org/mailman/listinfo/gluster-devel

Re: [Gluster-devel] bad file access (bit-rot + AFR)

Reply via email to