On Wed, Dec 14, 2016 at 1:48 PM, Xavier Hernandez <xhernan...@datalab.es> wrote:
> There's another issue with the patch that Ashish sent. > > The original problem is that a setattr on a symbolic link gets transformed > to a regular file while the fop is being executed. Even if we apply the > Ashish' patch to avoid the assert, the setattr fop will still succeed and > incorrectly change the attributes of a gluster special file that shouldn't > change. > > I think that's a bigger problem that needs to be addressed globally. > > I'm sure this is not an easy solution, but probably the best way would be > to have distinct inodes for the gluster link files and the original file. > This way most of these problems should be solved. > Is there any reason why there is a difference in type of the file on hashed/cached subvols? We can have the same type of file on both dht subvolumes? That will prevent unlink of regular file and recreate with the actual type of the file? > > Xavi > > > On 12/14/2016 09:02 AM, Xavier Hernandez wrote: > >> On 12/14/2016 06:10 AM, Raghavendra Gowdappa wrote: >> >>> >>> >>> ----- Original Message ----- >>> >>>> From: "Pranith Kumar Karampuri" <pkara...@redhat.com> >>>> To: "Ashish Pandey" <aspan...@redhat.com> >>>> Cc: "Gluster Devel" <gluster-devel@gluster.org>, "Shyam Ranganathan" >>>> <srang...@redhat.com>, "Nithya Balachandran" >>>> <nbala...@redhat.com>, "Xavier Hernandez" <xhernan...@datalab.es>, >>>> "Raghavendra Gowdappa" <rgowd...@redhat.com> >>>> Sent: Tuesday, December 13, 2016 9:29:46 PM >>>> Subject: Re: 1402538 : Assertion failure during rebalance of symbolic >>>> links >>>> >>>> On Tue, Dec 13, 2016 at 2:45 PM, Ashish Pandey <aspan...@redhat.com> >>>> wrote: >>>> >>>> Hi All, >>>>> >>>>> We have been seeing an issue where re balancing symbolic links leads >>>>> to an >>>>> assertion failure in EC volume. >>>>> >>>>> The root cause of this is that while migrating symbolic links to >>>>> other sub >>>>> volume, it creates a link file (with attributes .........T) . >>>>> This file is a regular file. >>>>> Now, during migration a setattr comes to this link and because of >>>>> possible >>>>> race, posix_stat return stats of this "T" file. >>>>> In ec_manager_seattr, we receive callbacks and check the type of >>>>> entry. If >>>>> it is a regular file we try to get size and if it is not there, we >>>>> raise an >>>>> assert. >>>>> So, basically we are checking a size of the link (which will not have >>>>> size) which has been returned as regular file and we are ending up when >>>>> this condition >>>>> becomes TRUE. >>>>> >>>>> Now, this looks like a problem with re balance and difficult to fix at >>>>> this point (as per the discussion). >>>>> We have an alternative to fix it in EC but that will be more like a >>>>> hack >>>>> than an actual fix. We should not modify EC >>>>> to deal with an individual issue which is in other translator. >>>>> >>>> >>> I am afraid, dht doesn't have a better way of handling this. While DHT >>> maintains abstraction (of a symbolic link) to layers above, the layers >>> below it cannot be shielded from seeing the details like a linkto file >>> etc. >>> >> >> That's ok, and I think it's the right thing to do. From the point of >> view of EC, it's irrelevant how the file is seen by upper layers. It >> only cares about the files below it. >> >> If the concern really is that the file is changing its type in a span >>> of single fop, we can probably explore the option of locking (or other >>> synchronization mechanisms) to prevent migration taking place, while a >>> fop is in progress. >>> >> >> That's the real problem. Some operations receive an inode referencing a >> symbolic link on input but the iatt structures from the callback >> reference a regular file. It's even worse because it's an asynchronous >> race so some of the bricks may return a regular file and some may return >> a symbolic link. If there are more than redundancy bricks returning a >> different type, the most probably result will be an I/O error caused by >> inconsistent answers. >> >> Ashish wrote a patch to check the type of the inode at the input instead >> of relying on the answers. While this could avoid the assertion issued >> by ec, it doesn't solve the race, leaving room for the I/O errors I >> mentioned earlier. >> >> But, I assume there will be performance penalties for that too. >>> >> >> Yes. I don't see any other way to really solve this problem. A lock is >> needed. >> >> In ec we already have a problem that will need an additional lock on >> rmdir, unlink and rename to avoid some races. This change will also need >> support from locks xlator to avoid granting locks on deleted inodes. If >> dht is using one of these operations to replace the symbolic link by the >> gluster link file, I think this change could solve the I/O errors, but >> I'm not sure we could completely solve the problem. >> >> I'm not sure how dht does the transform from a symbolic link to a >> gluster link file, but if it involves more than one fop from the point >> of view of ec, there's nothing that ec can do to solve the problem. If >> another client accesses the file, ec can return any intermediate state. >> DHT should take some lock to do all operations atomically and avoid >> problems on other clients. >> >> I think that the mid-term approach to completely solve the problem >> without a performance impact should be to implement some kind of >> transaction mechanism that will reuse lock requests. This would allow, >> among other things, that multiple atomic operations could be performed >> by different xlators but sharing the locks instead of requiring each >> xlator to take an inodelk on its own. >> >> Xavi >> >> >>> >>>>> Now the question is how to proceed with this? Any suggestions? >>>>> >>>>> >>>> Raghavendra/Nithya, >>>> Could one of you explain the difficulties in fixing this >>>> issue in >>>> DHT so that Xavi will also be caught up with why we should add this >>>> change >>>> in EC in the short term. >>>> >>>> >>>> >>>>> Details on this bug can be found here - >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1402538 >>>>> >>>>> ---- >>>>> Ashish >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Pranith >>>> >>>> >> > -- Pranith
_______________________________________________ Gluster-devel mailing list Gluster-devel@gluster.org http://www.gluster.org/mailman/listinfo/gluster-devel